Conversation
|
@mwootton As the original rocm profiler contributor, would you mind take a look at this? |
|
@abudup FYI, #10911 (comment), seems there is some reasons there is the RoctracerLogger.{cc,h} and ThreadUtil.{cc,h}. |
@cloudhan, thanks, I read through the comment, and the rationale there for keeping the names was to maintain name compatibility with Kineto. It isn't clear to me why maintaining compatibility with Kineto is a goal for an ORT profiler, especially now, when the code has diverged quite a bit from its original implementation. Could you please elaborate on any context that I'm missing? |
|
kineto the profiler library for pytorch. I don't think maintain name compatibility is the ultimate goal here. Behind it, the author might want to minimize the maintainance cost for future upgrade, that is, develop for kineto and just profit in ORT. I think the original author is accessible from teams or email, better ask him directly. |
|
Anyway, deadlock-less multi-session profiling is a must have feature for ROCm, otherwise, it will cause agony for stable diffusion models. |
Yes, I understand the original reason for wanting to maintain name compatibility: the source code was (almost?) an exact copy of the source code from the Kineto code base. But I've now reimplemented those bits, and the code in this PR bears little resemblance to the Kineto code. So, in effect, we've diverged from Kineto, and will not be able to pull from Kineto, going forward, if this PR is merged. |
@mwootton Thoughts? |
b3ea614 to
5474c0a
Compare
…13647) ### Description Improve the profile explorer by enabling shape sensitivity for GPU kernels. ### Motivation and Context Due to problems with the ROCM profiler, it was previously challenging to retrieve the shapes corresponding to a GPU kernel event. [PR 13546](#13549) addresses these problems, so it's now possible to retrieve shapes from the ORT ROCM/CUDA profilers. This PR leverages [PR 13546](#13549) to enable shape-sensitive GPU kernel ranking. Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
### Description The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both. ### Motivation and Context [PR 13549](#13549) brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well. Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed. Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing. [sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz) [sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz) [check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
### Description The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both. ### Motivation and Context [PR 13549](#13549) brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well. Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed. Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing. [sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz) [sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz) [check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
…l ordering between CPU and GPU events (microsoft#13549) ### Description The existing ROCM profiler has a few shortcomings, which this PR fixes. ### Motivation and Context The existing ROCM profiler: 1. Is not thread-safe 2. Is not session-aware: i.e., if multiple inference sessions enable profiling, then events (esp GPU events) get mixed up between the sessions 3. Has some issues with respect to coding standards. This PR addresses all of the above by cleanly re-implementing parts of the ROCM profiler as required. Attached are 4 profile outputs from a multi-session run of the StableDiffusion model, as well as a quick-and-dirty script that checks the profile outputs for the invariants claimed. [sd_profile_outputs.tar.gz](https://github.com/microsoft/onnxruntime/files/9924608/sd_profile_outputs.tar.gz) [check_profile_output_wellformedness.zip](https://github.com/microsoft/onnxruntime/files/9924614/check_profile_output_wellformedness.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
…icrosoft#13647) ### Description Improve the profile explorer by enabling shape sensitivity for GPU kernels. ### Motivation and Context Due to problems with the ROCM profiler, it was previously challenging to retrieve the shapes corresponding to a GPU kernel event. [PR 13546](microsoft#13549) addresses these problems, so it's now possible to retrieve shapes from the ORT ROCM/CUDA profilers. This PR leverages [PR 13546](microsoft#13549) to enable shape-sensitive GPU kernel ranking. Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
### Description The existing CUDA profiler is neither session-aware, nor thread-safe. This PR ensures both. ### Motivation and Context [PR 13549](microsoft#13549) brought thread-safety and session-awareness to the ROCm profiler. This PR brings the same goodness to the CUDA profiler as well. Sample outputs of a profiling run from the StableDiffusion model (this model was chosen because it requires orchestration of multiple sessions, and verifies that the profilers are now indeed session-aware) on both CUDA and ROCm EPs are attached, along with a script that checks that the trace files generated by the profile are well-formed. Update 11/29: Updated the profile outputs. The older profile outputs exhibited an issue where some timestamps were wildly out of range, leading to problems visualizing the traces. The bug has been fixed and the profile outputs have been updated, along with an update to the check script to ensure that timestamps are monotonically increasing. [sd_profile_outputs_cuda.tar.gz](https://github.com/microsoft/onnxruntime/files/10118088/sd_profile_outputs_cuda.tar.gz) [sd_profile_outputs_rocm.tar.gz](https://github.com/microsoft/onnxruntime/files/10118089/sd_profile_outputs_rocm.tar.gz) [check_profile_output_well_formedness.zip](https://github.com/microsoft/onnxruntime/files/10118090/check_profile_output_well_formedness.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
Description
The existing ROCM profiler has a few shortcomings, which this PR fixes.
Motivation and Context
The existing ROCM profiler:
This PR addresses all of the above by cleanly re-implementing parts of the ROCM profiler as required.
Attached are 4 profile outputs from a multi-session run of the StableDiffusion model, as well as a quick-and-dirty script that checks the profile outputs for the invariants claimed.
sd_profile_outputs.tar.gz
check_profile_output_wellformedness.zip