Session-aware and thread-safe CUDA profiler by abudup · Pull Request #13706 · microsoft/onnxruntime

abudup · 2022-11-19T01:10:36Z

Description

The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both.

Motivation and Context

PR 13549 brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well.

Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed.

Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing.

sd_profile_outputs_cuda.tar.gz
sd_profile_outputs_rocm.tar.gz
check_profile_output_well_formedness.zip

onnxruntime/core/providers/cuda/cupti_manager.cc

onnxruntime/core/providers/shared_library/provider_api.h

include/onnxruntime/core/common/gpu_profiler_common.h

onnxruntime/core/providers/cuda/cuda_profiler.cc

onnxruntime/core/providers/cuda/cupti_manager.cc

include/onnxruntime/core/common/gpu_profiler_common.h

…U profilers

cloudhan · 2022-11-30T03:18:29Z

include/onnxruntime/core/common/gpu_profiler_common.h

+      std::string parent_name;

+      // Tracers may not use Jan 1 1970 as an epoch for timestamps.
+      // So, we need to adjust the timestamp to something sensible.


I am not sure how this works actually, can you elaborate a bit?

Sure. This code is essentially trying to account for the fact that the timestamps reported by tracers (specifically, the ROCm tracer) for the beginning and end of a kernel are not guaranteed to use the "standard" epoch of Jan 1 1970. I believe the ROCm tracer uses system boot time as the epoch for timestamps (but I couldn't find this documented anywhere). Now, when we subtract the profiling start time (which is the number of ticks from Jan 1 1970), we end up with a large negative value, which messes up the visualization in a trace viewer as you observed. This code essentially tries to work around this behavior by ignoring the timestamps from tracers, and filling in something that is logical.

Note that this does not affect the durations of events: those are calculated simply as duration = end_time_ns - start_time_ns. It only affects the start timestamp. I'd argue that despite the inaccuracy, the start time is closer to the actual start time than if we simply used the tracer timestamps.

cloudhan · 2022-12-07T03:53:48Z

include/onnxruntime/core/common/gpu_profiler_common.h

+          evt.args["parent_name"] = parent_name;
+        }
+        evt.ts = last_ts + increment;
+        increment += evt.dur;


Doesn't this should be increment += <curr_event_ts> - <prev_event_ts> somehow? Otherwise, it makes all children events appears to be back to back, that is a critical skew in the trace view and will give the profiler users some illusion about the event timing.

Alternatively, is it viable just substract the timestamp diff between CPU events' time epoch and GPU events' time epoch? That is, we convert to unify them directly, instead of compute and try to reconstruct?

Doesn't this should be increment += <curr_event_ts> - <prev_event_ts> somehow? Otherwise, it makes all children events appears to be back to back, that is a critical skew in the trace view and will give the profiler users some illusion about the event timing.

Thanks! This is a much neater way of reconciling the two time streams: we preserve the relative timing between kernel launches. I've implemented a similar idea, by approximating the offset of the first kernel event, and then using that offset to adjust the subsequent kernel events.

As for the alternative approach you suggested: that's challenging, primarily because I could not find documentation about the epoch used by roctracer. I'm presuming that it's system boot time, but I could be wrong.

cloudhan · 2022-12-07T03:56:02Z

include/onnxruntime/core/common/gpu_profiler_common.h

+      while (event_iter != event_end &&
+             (event_iter->ts < ts ||
+              (event_iter->ts == ts &&
+               (event_iter + 1) != event_end &&
+               (event_iter + 1)->ts == ts))) {
+        merged_events.emplace_back(*event_iter);
+        ++event_iter;
+      }


It seems that you are doing a linear scan here, does events guaranteed to be sorted by timestamp?

From my understanding of the code, events is sorted by timestamp. That said, I've only refactored existing code, which assumed that events is sorted in non-decreasing order.

### Description The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both. ### Motivation and Context [PR 13549](#13549) brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well. Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed. Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing. [sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz) [sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz) [check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>

### Description The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both. ### Motivation and Context [PR 13549](microsoft#13549) brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well. Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed. Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing. [sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz) [sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz) [check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>

@edgchen1

### Description #### TLDR This PR ports the existing C++ [EpProfiler](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/framework/execution_provider.h#L359) interfaces used by provider-bridge EPs to the binary-stable C APIs for plugin EPs. It introduces C/C++ APIs for creating/querying profiling events, a container for appending EP events, and callback hooks (`StartEvent`/`StopEvent`) that give EPs access to ORT event metadata in real-time. #### Changes to the original C++ API The original `EpProfiler` C++ interface was adapted for the C API with the following intentional changes: 1. **`StartProfiling`** now receives an offset indicating the elapsed time since profiling started, as opposed to receiving an absolute/epoch-dependent profiling start time. This prevents EPs from having to do epoch conversions. Credit to @edgchen1 for the idea. 2. **`StartEvent`/`StopEvent` receive an absolute, epoch-based correlation ID (`ort_event_correlation_id`)** instead of a relative ORT event ID. The `PluginEpProfiler` bridge layer automatically converts the C++ `relative_ort_event_id` (microseconds since profiling start) to an absolute `ort_event_correlation_id` by adding the epoch-based profiling start time. This means plugin EPs can use the correlation ID directly with profiling utilities like CUPTI or ROCTracer without computing the conversion themselves. 3. **`StopEvent` now receives the completed ORT event as a parameter.** This allows EPs to optionally inspect ORT event metadata (e.g., `op_name`, `event_name`) at the time the event ends, facilitating annotation of correlated EP events. 4. **`EndProfiling` only allows EPs to *append* events (via `OrtProfilingEventsContainer`), not read or modify the full events array.** This is motivated by: - Prevent any one EP from modifying events generated by ORT or another EP. - Certain EPs (VitisAI and WebGPU) already only append events without reading the entire events array. - The CUDA EP reads the entire events array solely to merge/sort its own EP events next to correlated ORT events and add `parent_name`/`op_name` metadata. However: - Merging/sorting is mostly unnecessary since trace viewers that load these files do their own event sorting. - This merging/sorting step was previously required to augment CUDA EP events with metadata from the correlated ORT event. However, that can now be obtained more simply via the new `StopEvent` parameter that provides the EP with the full correlated ORT event. - The [merge algorithm used by CUDA EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397) **incorrectly** assumes ORT events are sorted by non-decreasing *start* time, but they are actually sorted by [non-decreasing *end* time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91) (also see #13706 (comment)). Fixing this would require sorting the entire Events array before asking a provider-bridge EP to merge in its events into the global events array. Not sure this is worth the runtime cost. #### Naming conventions for ORT event IDs - **C++ `EpProfiler` interface** (existing): Uses `relative_ort_event_id` — a timestamp offset in microseconds relative to profiling start. - **C API `OrtEpProfilerImpl`** (new in this PR): Uses `ort_event_correlation_id` — an absolute, epoch-based timestamp in microseconds computed from `std::chrono::high_resolution_clock` (platform-defined epoch). Unique across concurrent profiling sessions within the same process. - **Conversion**: The `PluginEpProfiler` bridge class (in `ep_event_profiling.cc`) performs `ort_event_correlation_id = relative_ort_event_id + profiling_start_time_epoch_us_`, mirroring the pattern in `GPUTracerManager::PushCorrelation`. ### New C APIs | API | Description | |-----|-------------| | `CreateProfilingEvent` | Create a profiling event with category, process/thread IDs, name, timestamp, duration, and key-value args | | `ReleaseProfilingEvent` | Release a profiling event | | `ProfilingEvent_GetCategory` | Get event category (`SESSION`, `NODE`, `KERNEL`, `API`) | | `ProfilingEvent_GetName` | Get event name | | `ProfilingEvent_GetTimestampUs` | Get event start timestamp (µs) | | `ProfilingEvent_GetDurationUs` | Get event duration (µs) | | `ProfilingEvent_GetArgValue` | Get an event argument value by key | | `ProfilingEventsContainer_AddEvents` | Append an array of EP events to the output container | | `OrtEp::CreateProfiler` | Returns an instance of the EP's profiler implementation | | `OrtEpProfilerImpl::StartProfiling` | Called by ORT to start a profiling session. Receives elapsed time offset (ns) since ORT profiling started | | `OrtEpProfilerImpl::StartEvent` | Called by ORT to notify that an ORT event has started. Receives an absolute `ort_event_correlation_id` | | `OrtEpProfilerImpl::StopEvent` | Called by ORT to notify that an ORT event has ended. Receives the same `ort_event_correlation_id` and ORT event metadata | | `OrtEpProfilerImpl::EndProfiling` | Called by ORT to end the profiling session and collect EP events into the output container | | `OrtEpProfilerImpl::Release` | Release the profiler instance | ### New C++ wrapper classes | Class | Description | |-------|-------------| | `Ort::ConstProfilingEvent` | Non-owning const wrapper for reading fields from an `OrtProfilingEvent` (e.g., in `StopEvent`) | | `Ort::ProfilingEvent` | Owning wrapper that creates and manages an `OrtProfilingEvent` (e.g., for `EndProfiling`) | | `Ort::UnownedProfilingEventsContainer` | Non-owning wrapper for adding events to an `OrtProfilingEventsContainer` during `EndProfiling` | ### Example EP profiling implementation This PR updates an example plugin EP to use the new profiling APIs: - Plugin EP code: [test/autoep/library/example_plugin_ep_kernel_registry](https://github.com/microsoft/onnxruntime/tree/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry) - `OrtEpProfilerImpl` implementation: [ep_profiling.h](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.h) / [ep_profiling.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.cc) - `OrtEp::CreateProfiler()` implementation: [ep.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep.cc) ### Existing bugs found Not fixed in this PR. - The [merge algorithm used by CUDA EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397) **incorrectly** assumes ORT events are sorted by non-decreasing *start* time, but they are actually sorted by [non-decreasing *end* time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91) (also see #13706 (comment)). - Run profilers do not handle subgraphs (e.g., subgraph of a control-flow operator). Has been the case since run profilers were [introduced](#26846). ### Motivation and Context Allows plugin EPs to generate profiling events, further closing the functionality gap between provider-bridge EPs and plugin EPs. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

Abhishek Udupa added 14 commits November 16, 2022 00:37

WIP

022daef

WIP

e04b80f

Fixes

d8f6c63

Complete implementation of profiler

b28c38a

Refactor code common to GPU profilers into separate CUs

b62edc2

Fixes

9e6c356

Fixes

d1783f3

Fix typo

779f78e

Add guards

be8bb8d

Fixes

fa85c1f

Reimplement using CRTP/inline classes to avoid linker challenges

0972809

Minor fixes. Works fine on CUDA

2b48522

Fixes for ROCm

760d589

Ran clang-format on all changed source files

a771163

abudup requested review from RandySheriffH, cloudhan and zhangyaobit November 19, 2022 01:15

snnn reviewed Nov 19, 2022

View reviewed changes

onnxruntime/core/providers/cuda/cupti_manager.cc Outdated Show resolved Hide resolved

Abhishek Udupa added 3 commits November 19, 2022 02:14

Fix memory leak

4d5542d

Fix

2253df5

semicolon

67479d5