[PROTON] Migrate Proton ROCm backend from roctracer to rocprofiler-sdk#9704
Merged
Jokeren merged 60 commits intoMay 14, 2026
Conversation
Replace the deprecated roctracer-based profiling backend with a new
implementation built on rocprofiler-sdk, using late-start via
rocprofiler_force_configure so no LD_PRELOAD or tool-library preloading
is required.
Key changes:
- Add RocprofSDKProfiler with a two-context architecture:
* codeObjectContext (always active): lightweight callback for
kernel_id -> name registration as code objects are loaded.
* profilingContext (on-demand): HIP runtime API callback tracing
and buffer-based kernel dispatch tracing, started in doStart()
and stopped in doStop() to match Proton's start/stop idiomatics.
- Eagerly call force_configure at time on AMD
so interception hooks are installed before any HSA queues are created.
Both contexts are registered at this point, causing the SDK to install
queue hooks. Only the lightweight codeObjectContext is activated
immediately.
- Rewrite _select_backend() to infer the backend from the registered
backends dict rather than calling get_current_target(), which would
trigger HIP runtime init before force_configure can run.
- Wire up ROCTx marker tracing via libroctx64's native callback API
(roctxRegisterTracerCallback) since rocprofiler-sdk's marker service
requires its replacement roctx library, unavailable with late-start.
- Add RocprofApi dispatch layer (ExternLibRocprofiler) for runtime
dlopen/dlsym of librocprofiler-sdk.so, with optional path override
via TRITON_ROCPROFILER_SDK_LIB_PATH.
- Update CMake to discover rocprofiler-sdk headers and plumb
ROCPROFILER_SDK_INCLUDE_DIR into the build.
…), getKernelName Fix using shared lock instead of two lock acquis, simplified no correlation path, missing capture counting api. chagnes to see if nvidia CI runner works
Contributor
|
@ZelboK Feel free to let me know if it's ready for review! |
CRobeck
reviewed
Mar 13, 2026
Contributor
Author
Feel free to review :) |
Jokeren
reviewed
Mar 13, 2026
Jokeren
reviewed
Mar 17, 2026
…g tests from forceconfigure
Jokeren
reviewed
Apr 13, 2026
This reverts commit 20f4a72.
Jokeren
reviewed
May 2, 2026
antiagainst
reviewed
May 4, 2026
Contributor
|
Hi @ZelboK can you please address the last comment and fix the CI problem? We may want to merge the PR this week |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Does not need
LD_PRELOAD. Intializes rocprofsdk on the triton profiler import which does a lightweight intialization one time cost which replaceshsa_queue_create_fnpointer in the HSA table and creates & registers our 2 SDK contexts. Must happen before any GPU operation creates an HSA queue, because the SDK can only intercept queues created after the replacement. Queues that already exist are plain HSA queues and invisible to the profiler.context 1:
codeObjectContextused for kernel registration namescontext 2:
profilingContextdoes the heavy work which is started only afterdoStart(0anddoStop().between proton
start()andend(0the SDK WriteInterceptor intercepts the dispatches inbetween with barrier packets.Rewrote
_select_backend()to avoid callingget_current_target(), which triggers HIP runtime init beforeforce_configurecan run.Tested locally by building rocm-systems from main with
TRITON_ROCPROFILER_SDK..env's.