Skip to content

Make the ROCM profiler thread-safe, session-aware and preserve logical ordering between CPU and GPU events#13549

Merged
abudup merged 41 commits intomainfrom
abudup/session-aware-gpu-profilers
Nov 10, 2022
Merged

Make the ROCM profiler thread-safe, session-aware and preserve logical ordering between CPU and GPU events#13549
abudup merged 41 commits intomainfrom
abudup/session-aware-gpu-profilers

Conversation

@abudup
Copy link
Copy Markdown
Contributor

@abudup abudup commented Nov 2, 2022

Description

The existing ROCM profiler has a few shortcomings, which this PR fixes.

Motivation and Context

The existing ROCM profiler:

  1. Is not thread-safe
  2. Is not session-aware: i.e., if multiple inference sessions enable profiling, then events (esp GPU events) get mixed up between the sessions
  3. Has some issues with respect to coding standards.

This PR addresses all of the above by cleanly re-implementing parts of the ROCM profiler as required.

Attached are 4 profile outputs from a multi-session run of the StableDiffusion model, as well as a quick-and-dirty script that checks the profile outputs for the invariants claimed.

sd_profile_outputs.tar.gz

check_profile_output_wellformedness.zip

@cloudhan
Copy link
Copy Markdown
Contributor

cloudhan commented Nov 3, 2022

@mwootton As the original rocm profiler contributor, would you mind take a look at this?

@cloudhan
Copy link
Copy Markdown
Contributor

cloudhan commented Nov 4, 2022

@abudup FYI, #10911 (comment), seems there is some reasons there is the RoctracerLogger.{cc,h} and ThreadUtil.{cc,h}.

@abudup
Copy link
Copy Markdown
Contributor Author

abudup commented Nov 4, 2022

@abudup FYI, #10911 (comment), seems there is some reasons there is the RoctracerLogger.{cc,h} and ThreadUtil.{cc,h}.

@cloudhan, thanks, I read through the comment, and the rationale there for keeping the names was to maintain name compatibility with Kineto. It isn't clear to me why maintaining compatibility with Kineto is a goal for an ORT profiler, especially now, when the code has diverged quite a bit from its original implementation. Could you please elaborate on any context that I'm missing?

@cloudhan
Copy link
Copy Markdown
Contributor

cloudhan commented Nov 4, 2022

kineto the profiler library for pytorch. I don't think maintain name compatibility is the ultimate goal here. Behind it, the author might want to minimize the maintainance cost for future upgrade, that is, develop for kineto and just profit in ORT. I think the original author is accessible from teams or email, better ask him directly.

@cloudhan
Copy link
Copy Markdown
Contributor

cloudhan commented Nov 4, 2022

Anyway, deadlock-less multi-session profiling is a must have feature for ROCm, otherwise, it will cause agony for stable diffusion models.

@abudup
Copy link
Copy Markdown
Contributor Author

abudup commented Nov 4, 2022

kineto the profiler library for pytorch. I don't think maintain name compatibility is the ultimate goal here. Behind it, the author might want to minimize the maintainance cost for future upgrade, that is, develop for kineto and just profit in ORT. I think the original author is accessible from teams or email, better ask him directly.

Yes, I understand the original reason for wanting to maintain name compatibility: the source code was (almost?) an exact copy of the source code from the Kineto code base. But I've now reimplemented those bits, and the code in this PR bears little resemblance to the Kineto code. So, in effect, we've diverged from Kineto, and will not be able to pull from Kineto, going forward, if this PR is merged.

@abudup
Copy link
Copy Markdown
Contributor Author

abudup commented Nov 4, 2022

kineto the profiler library for pytorch. I don't think maintain name compatibility is the ultimate goal here. Behind it, the author might want to minimize the maintainance cost for future upgrade, that is, develop for kineto and just profit in ORT. I think the original author is accessible from teams or email, better ask him directly.

Yes, I understand the original reason for wanting to maintain name compatibility: the source code was (almost?) an exact copy of the source code from the Kineto code base. But I've now reimplemented those bits, and the code in this PR bears little resemblance to the Kineto code. So, in effect, we've diverged from Kineto, and will not be able to pull from Kineto, going forward, if this PR is merged.

@mwootton Thoughts?

@abudup abudup assigned abudup and unassigned abudup Nov 8, 2022
cloudhan
cloudhan previously approved these changes Nov 9, 2022
@abudup abudup force-pushed the abudup/session-aware-gpu-profilers branch from b3ea614 to 5474c0a Compare November 9, 2022 23:57
@abudup abudup merged commit 9954454 into main Nov 10, 2022
@abudup abudup deleted the abudup/session-aware-gpu-profilers branch November 10, 2022 18:25
abudup added a commit that referenced this pull request Nov 15, 2022
…13647)

### Description
Improve the profile explorer by enabling shape sensitivity for GPU
kernels.



### Motivation and Context
Due to problems with the ROCM profiler, it was previously challenging to
retrieve the shapes corresponding to a GPU kernel event. [PR
13546](#13549) addresses
these problems, so it's now possible to retrieve shapes from the ORT
ROCM/CUDA profilers. This PR leverages [PR
13546](#13549) to enable
shape-sensitive GPU kernel ranking.

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
abudup added a commit that referenced this pull request Dec 9, 2022
### Description
The existing CUDA profiler is neither session-aware, nor thread-safe.
This PR ensures both.

### Motivation and Context
[PR 13549](#13549) brought
thread-safety and session-awareness to the ROCm profiler. This PR brings
the same goodness to the CUDA profiler as well.

Sample outputs of a profiling run from the StableDiffusion model (this
model was chosen because it requires orchestration of multiple sessions,
and verifies that the profilers are now indeed session-aware) on both
CUDA and ROCm EPs are attached, along with a script that checks that the
trace files generated by the profile are well-formed.

Update 11/29: Updated the profile outputs. The older profile outputs
exhibited an issue where some timestamps were wildly out of range,
leading to problems visualizing the traces. The bug has been fixed and
the profile outputs have been updated, along with an update to the check
script to ensure that timestamps are monotonically increasing.


[sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz)

[sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz)

[check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
baijumeswani pushed a commit that referenced this pull request Dec 13, 2022
### Description
The existing CUDA profiler is neither session-aware, nor thread-safe.
This PR ensures both.

### Motivation and Context
[PR 13549](#13549) brought
thread-safety and session-awareness to the ROCm profiler. This PR brings
the same goodness to the CUDA profiler as well.

Sample outputs of a profiling run from the StableDiffusion model (this
model was chosen because it requires orchestration of multiple sessions,
and verifies that the profilers are now indeed session-aware) on both
CUDA and ROCm EPs are attached, along with a script that checks that the
trace files generated by the profile are well-formed.

Update 11/29: Updated the profile outputs. The older profile outputs
exhibited an issue where some timestamps were wildly out of range,
leading to problems visualizing the traces. The bug has been fixed and
the profile outputs have been updated, along with an update to the check
script to ensure that timestamps are monotonically increasing.


[sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz)

[sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz)

[check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
simon-moo pushed a commit to simon-moo/onnxruntime that referenced this pull request Dec 21, 2022
…l ordering between CPU and GPU events (microsoft#13549)

### Description
The existing ROCM profiler has a few shortcomings, which this PR fixes.

### Motivation and Context
The existing ROCM profiler:
1. Is not thread-safe
2. Is not session-aware: i.e., if multiple inference sessions enable
profiling, then events (esp GPU events) get mixed up between the
sessions
3. Has some issues with respect to coding standards.

This PR addresses all of the above by cleanly re-implementing parts of
the ROCM profiler as required.

Attached are 4 profile outputs from a multi-session run of the
StableDiffusion model, as well as a quick-and-dirty script that checks
the profile outputs for the invariants claimed.


[sd_profile_outputs.tar.gz](https://github.com/microsoft/onnxruntime/files/9924608/sd_profile_outputs.tar.gz)


[check_profile_output_wellformedness.zip](https://github.com/microsoft/onnxruntime/files/9924614/check_profile_output_wellformedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
simon-moo pushed a commit to simon-moo/onnxruntime that referenced this pull request Dec 21, 2022
…icrosoft#13647)

### Description
Improve the profile explorer by enabling shape sensitivity for GPU
kernels.



### Motivation and Context
Due to problems with the ROCM profiler, it was previously challenging to
retrieve the shapes corresponding to a GPU kernel event. [PR
13546](microsoft#13549) addresses
these problems, so it's now possible to retrieve shapes from the ORT
ROCM/CUDA profilers. This PR leverages [PR
13546](microsoft#13549) to enable
shape-sensitive GPU kernel ranking.

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
simon-moo pushed a commit to simon-moo/onnxruntime that referenced this pull request Dec 26, 2022
### Description
The existing CUDA profiler is neither session-aware, nor thread-safe.
This PR ensures both.

### Motivation and Context
[PR 13549](microsoft#13549) brought
thread-safety and session-awareness to the ROCm profiler. This PR brings
the same goodness to the CUDA profiler as well.

Sample outputs of a profiling run from the StableDiffusion model (this
model was chosen because it requires orchestration of multiple sessions,
and verifies that the profilers are now indeed session-aware) on both
CUDA and ROCm EPs are attached, along with a script that checks that the
trace files generated by the profile are well-formed.

Update 11/29: Updated the profile outputs. The older profile outputs
exhibited an issue where some timestamps were wildly out of range,
leading to problems visualizing the traces. The bug has been fixed and
the profile outputs have been updated, along with an update to the check
script to ensure that timestamps are monotonically increasing.


[sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz)

[sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz)

[check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants