Skip to content

Session-aware and thread-safe CUDA profiler#13706

Merged
abudup merged 26 commits intomainfrom
abudup/session-aware-cuda-profiler
Dec 9, 2022
Merged

Session-aware and thread-safe CUDA profiler#13706
abudup merged 26 commits intomainfrom
abudup/session-aware-cuda-profiler

Conversation

@abudup
Copy link
Copy Markdown
Contributor

@abudup abudup commented Nov 19, 2022

Description

The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both.

Motivation and Context

PR 13549 brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well.

Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed.

Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing.

sd_profile_outputs_cuda.tar.gz
sd_profile_outputs_rocm.tar.gz
check_profile_output_well_formedness.zip

@abudup abudup requested a review from cloudhan November 29, 2022 23:23
std::string parent_name;

// Tracers may not use Jan 1 1970 as an epoch for timestamps.
// So, we need to adjust the timestamp to something sensible.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this works actually, can you elaborate a bit?

Copy link
Copy Markdown
Contributor Author

@abudup abudup Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This code is essentially trying to account for the fact that the timestamps reported by tracers (specifically, the ROCm tracer) for the beginning and end of a kernel are not guaranteed to use the "standard" epoch of Jan 1 1970. I believe the ROCm tracer uses system boot time as the epoch for timestamps (but I couldn't find this documented anywhere). Now, when we subtract the profiling start time (which is the number of ticks from Jan 1 1970), we end up with a large negative value, which messes up the visualization in a trace viewer as you observed. This code essentially tries to work around this behavior by ignoring the timestamps from tracers, and filling in something that is logical.

Note that this does not affect the durations of events: those are calculated simply as duration = end_time_ns - start_time_ns. It only affects the start timestamp. I'd argue that despite the inaccuracy, the start time is closer to the actual start time than if we simply used the tracer timestamps.

@abudup abudup requested a review from cloudhan December 6, 2022 01:32
evt.args["parent_name"] = parent_name;
}
evt.ts = last_ts + increment;
increment += evt.dur;
Copy link
Copy Markdown
Contributor

@cloudhan cloudhan Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this should be increment += <curr_event_ts> - <prev_event_ts> somehow? Otherwise, it makes all children events appears to be back to back, that is a critical skew in the trace view and will give the profiler users some illusion about the event timing.

Alternatively, is it viable just substract the timestamp diff between CPU events' time epoch and GPU events' time epoch? That is, we convert to unify them directly, instead of compute and try to reconstruct?

Copy link
Copy Markdown
Contributor Author

@abudup abudup Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this should be increment += <curr_event_ts> - <prev_event_ts> somehow? Otherwise, it makes all children events appears to be back to back, that is a critical skew in the trace view and will give the profiler users some illusion about the event timing.

Thanks! This is a much neater way of reconciling the two time streams: we preserve the relative timing between kernel launches. I've implemented a similar idea, by approximating the offset of the first kernel event, and then using that offset to adjust the subsequent kernel events.

As for the alternative approach you suggested: that's challenging, primarily because I could not find documentation about the epoch used by roctracer. I'm presuming that it's system boot time, but I could be wrong.

Comment on lines +358 to +365
while (event_iter != event_end &&
(event_iter->ts < ts ||
(event_iter->ts == ts &&
(event_iter + 1) != event_end &&
(event_iter + 1)->ts == ts))) {
merged_events.emplace_back(*event_iter);
++event_iter;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you are doing a linear scan here, does events guaranteed to be sorted by timestamp?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding of the code, events is sorted by timestamp. That said, I've only refactored existing code, which assumed that events is sorted in non-decreasing order.

cloudhan
cloudhan previously approved these changes Dec 8, 2022
@abudup abudup merged commit 83c59d2 into main Dec 9, 2022
@abudup abudup deleted the abudup/session-aware-cuda-profiler branch December 9, 2022 21:22
baijumeswani pushed a commit that referenced this pull request Dec 13, 2022
### Description
The existing CUDA profiler is neither session-aware, nor thread-safe.
This PR ensures both.

### Motivation and Context
[PR 13549](#13549) brought
thread-safety and session-awareness to the ROCm profiler. This PR brings
the same goodness to the CUDA profiler as well.

Sample outputs of a profiling run from the StableDiffusion model (this
model was chosen because it requires orchestration of multiple sessions,
and verifies that the profilers are now indeed session-aware) on both
CUDA and ROCm EPs are attached, along with a script that checks that the
trace files generated by the profile are well-formed.

Update 11/29: Updated the profile outputs. The older profile outputs
exhibited an issue where some timestamps were wildly out of range,
leading to problems visualizing the traces. The bug has been fixed and
the profile outputs have been updated, along with an update to the check
script to ensure that timestamps are monotonically increasing.


[sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz)

[sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz)

[check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
simon-moo pushed a commit to simon-moo/onnxruntime that referenced this pull request Dec 26, 2022
### Description
The existing CUDA profiler is neither session-aware, nor thread-safe.
This PR ensures both.

### Motivation and Context
[PR 13549](microsoft#13549) brought
thread-safety and session-awareness to the ROCm profiler. This PR brings
the same goodness to the CUDA profiler as well.

Sample outputs of a profiling run from the StableDiffusion model (this
model was chosen because it requires orchestration of multiple sessions,
and verifies that the profilers are now indeed session-aware) on both
CUDA and ROCm EPs are attached, along with a script that checks that the
trace files generated by the profile are well-formed.

Update 11/29: Updated the profile outputs. The older profile outputs
exhibited an issue where some timestamps were wildly out of range,
leading to problems visualizing the traces. The bug has been fixed and
the profile outputs have been updated, along with an update to the check
script to ensure that timestamps are monotonically increasing.


[sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz)

[sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz)

[check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
adrianlizarraga added a commit that referenced this pull request Apr 1, 2026
### Description
#### TLDR

This PR ports the existing C++
[EpProfiler](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/framework/execution_provider.h#L359)
interfaces used by provider-bridge EPs to the binary-stable C APIs for
plugin EPs. It introduces C/C++ APIs for creating/querying profiling
events, a container for appending EP events, and callback hooks
(`StartEvent`/`StopEvent`) that give EPs access to ORT event metadata in
real-time.

#### Changes to the original C++ API

The original `EpProfiler` C++ interface was adapted for the C API with
the following intentional changes:

1. **`StartProfiling`** now receives an offset indicating the elapsed
time since profiling started, as opposed to receiving an
absolute/epoch-dependent profiling start time. This prevents EPs from
having to do epoch conversions. Credit to @edgchen1 for the idea.
2. **`StartEvent`/`StopEvent` receive an absolute, epoch-based
correlation ID (`ort_event_correlation_id`)** instead of a relative ORT
event ID. The `PluginEpProfiler` bridge layer automatically converts the
C++ `relative_ort_event_id` (microseconds since profiling start) to an
absolute `ort_event_correlation_id` by adding the epoch-based profiling
start time. This means plugin EPs can use the correlation ID directly
with profiling utilities like CUPTI or ROCTracer without computing the
conversion themselves.
3. **`StopEvent` now receives the completed ORT event as a parameter.**
This allows EPs to optionally inspect ORT event metadata (e.g.,
`op_name`, `event_name`) at the time the event ends, facilitating
annotation of correlated EP events.
4. **`EndProfiling` only allows EPs to *append* events (via
`OrtProfilingEventsContainer`), not read or modify the full events
array.** This is motivated by:
- Prevent any one EP from modifying events generated by ORT or another
EP.
- Certain EPs (VitisAI and WebGPU) already only append events without
reading the entire events array.
- The CUDA EP reads the entire events array solely to merge/sort its own
EP events next to correlated ORT events and add `parent_name`/`op_name`
metadata. However:
- Merging/sorting is mostly unnecessary since trace viewers that load
these files do their own event sorting.
- This merging/sorting step was previously required to augment CUDA EP
events with metadata from the correlated ORT event. However, that can
now be obtained more simply via the new `StopEvent` parameter that
provides the EP with the full correlated ORT event.
- The [merge algorithm used by CUDA
EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397)
**incorrectly** assumes ORT events are sorted by non-decreasing *start*
time, but they are actually sorted by [non-decreasing *end*
time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91)
(also see
#13706 (comment)).
Fixing this would require sorting the entire Events array before asking
a provider-bridge EP to merge in its events into the global events
array. Not sure this is worth the runtime cost.

#### Naming conventions for ORT event IDs

- **C++ `EpProfiler` interface** (existing): Uses
`relative_ort_event_id` — a timestamp offset in microseconds relative to
profiling start.
- **C API `OrtEpProfilerImpl`** (new in this PR): Uses
`ort_event_correlation_id` — an absolute, epoch-based timestamp in
microseconds computed from `std::chrono::high_resolution_clock`
(platform-defined epoch). Unique across concurrent profiling sessions
within the same process.
- **Conversion**: The `PluginEpProfiler` bridge class (in
`ep_event_profiling.cc`) performs `ort_event_correlation_id =
relative_ort_event_id + profiling_start_time_epoch_us_`, mirroring the
pattern in `GPUTracerManager::PushCorrelation`.

### New C APIs

| API | Description |
|-----|-------------|
| `CreateProfilingEvent` | Create a profiling event with category,
process/thread IDs, name, timestamp, duration, and key-value args |
| `ReleaseProfilingEvent` | Release a profiling event |
| `ProfilingEvent_GetCategory` | Get event category (`SESSION`, `NODE`,
`KERNEL`, `API`) |
| `ProfilingEvent_GetName` | Get event name |
| `ProfilingEvent_GetTimestampUs` | Get event start timestamp (µs) |
| `ProfilingEvent_GetDurationUs` | Get event duration (µs) |
| `ProfilingEvent_GetArgValue` | Get an event argument value by key |
| `ProfilingEventsContainer_AddEvents` | Append an array of EP events to
the output container |
| `OrtEp::CreateProfiler` | Returns an instance of the EP's profiler
implementation |
| `OrtEpProfilerImpl::StartProfiling` | Called by ORT to start a
profiling session. Receives elapsed time offset (ns) since ORT profiling
started |
| `OrtEpProfilerImpl::StartEvent` | Called by ORT to notify that an ORT
event has started. Receives an absolute `ort_event_correlation_id` |
| `OrtEpProfilerImpl::StopEvent` | Called by ORT to notify that an ORT
event has ended. Receives the same `ort_event_correlation_id` and ORT
event metadata |
| `OrtEpProfilerImpl::EndProfiling` | Called by ORT to end the profiling
session and collect EP events into the output container |
| `OrtEpProfilerImpl::Release` | Release the profiler instance |

### New C++ wrapper classes

| Class | Description |
|-------|-------------|
| `Ort::ConstProfilingEvent` | Non-owning const wrapper for reading
fields from an `OrtProfilingEvent` (e.g., in `StopEvent`) |
| `Ort::ProfilingEvent` | Owning wrapper that creates and manages an
`OrtProfilingEvent` (e.g., for `EndProfiling`) |
| `Ort::UnownedProfilingEventsContainer` | Non-owning wrapper for adding
events to an `OrtProfilingEventsContainer` during `EndProfiling` |

### Example EP profiling implementation
This PR updates an example plugin EP to use the new profiling APIs:
- Plugin EP code:
[test/autoep/library/example_plugin_ep_kernel_registry](https://github.com/microsoft/onnxruntime/tree/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry)
- `OrtEpProfilerImpl` implementation:
[ep_profiling.h](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.h)
/
[ep_profiling.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep_profiling.cc)
- `OrtEp::CreateProfiler()` implementation:
[ep.cc](https://github.com/microsoft/onnxruntime/blob/adrianl/PluginEp_ProfilingApis/onnxruntime/test/autoep/library/example_plugin_ep_kernel_registry/ep.cc)

### Existing bugs found
Not fixed in this PR.

- The [merge algorithm used by CUDA
EP](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/include/onnxruntime/core/common/gpu_profiler_common.h#L391-L397)
**incorrectly** assumes ORT events are sorted by non-decreasing *start*
time, but they are actually sorted by [non-decreasing *end*
time](https://github.com/microsoft/onnxruntime/blob/faad20f9d3264c7f3b6d4e4398990e13ee864512/onnxruntime/core/common/profiler.cc#L91)
(also see
#13706 (comment)).
- Run profilers do not handle subgraphs (e.g., subgraph of a
control-flow operator). Has been the case since run profilers were
[introduced](#26846).

### Motivation and Context
Allows plugin EPs to generate profiling events, further closing the
functionality gap between provider-bridge EPs and plugin EPs.

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants