-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[TRT RTX EP] Add support for D3D12 external resource import #26948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Awesome, thanks for this! You're seeing my inexperience with the CUDA API here :), I have another branch I was working on to fix some of the context stuff, but I figure this implementation will be a longer-term collaboration/hand-off at some point. Just wanted to validate the API with some real code. |
|
Yes sure, we (or in other words @praneshgo ) will probably take it over. I made these changes for the exact same reason to experiment with it. And i already identified a TRT RTX optimization opportunity that we will fix internally. To better test the correct async behaviour by the way it might be better to submit multiple inferences and wait on the last result to ensure that we are not synchronous due to CPU overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a comprehensive test for CUDA Interop Graphics (CIG) inference to demonstrate proper usage patterns when working with external D3D12 resources. The key change is modifying context management to avoid calling cudaSetDevice when a CUDA context already exists, which prevents creating unwanted new contexts during CIG workflows.
Changes:
- Added
FullInferenceWithExternalMemoryCIGtest demonstrating CIG context usage with external memory import - Modified context management in
nv_provider_factory.ccto check for existing CUDA contexts before callingcudaSetDevice - Migrated test API calls from
ort_api_toort_interop_api_for external resource operations
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc | Added CudaDriverLoader helper class, renamed test fixture to NvExecutionProviderExternalResourceImporterTest, migrated to interop API, and added comprehensive CIG inference test |
| onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc | Modified ImportMemory, ImportSemaphore, and CreateSyncStreamForDevice to check for existing CUDA context before calling cudaSetDevice |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| onnxruntime::nv::provider_option_names::kUserComputeStream, | ||
| onnxruntime::nv::provider_option_names::kHasUserComputeStream, | ||
| onnxruntime::nv::provider_option_names::kMaxSharedMemSize, | ||
| // TRT will create it's own context to create streams if we do not manually provide aux streams |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment has a typo: "it's own context" should be "its own context" (possessive pronoun doesn't use an apostrophe).
| // TRT will create it's own context to create streams if we do not manually provide aux streams | |
| // TRT will create its own context to create streams if we do not manually provide aux streams |
| if (CUDA_SUCCESS != cuCtxGetCurrent(&cu_context)) { | ||
| return impl.ort_api.CreateStatus(ORT_EP_FAIL, "cuCtxGetCurrent failed"); | ||
| } | ||
| if (!cu_context) { // do not alter currently set context |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context management logic should check the return value of cuCtxGetCurrent before dereferencing cu_context. If cuCtxGetCurrent fails (returns non-CUDA_SUCCESS), the code currently returns an error, but if it succeeds and cu_context is null, that's a valid state. However, the current implementation checks if cuCtxGetCurrent fails first, which is correct. Consider adding a comment to clarify that a null context after successful cuCtxGetCurrent is expected when no context is set.
| if (!cu_context) { // do not alter currently set context | |
| if (!cu_context) { | |
| // cuCtxGetCurrent succeeded but returned nullptr, which indicates that no CUDA context | |
| // is currently set for this thread. This is a valid state; in this case we explicitly | |
| // reset and set the device for this EP without modifying any existing non-null context. |
| cuCtxSetCurrent_fn = reinterpret_cast<cuCtxSetCurrent_t>( | ||
| GetProcAddress(cuda_driver_dll_, "cuCtxSetCurrent")); | ||
| cuCtxGetCurrent_fn = reinterpret_cast<cuCtxGetCurrent_t>( | ||
| GetProcAddress(cuda_driver_dll_, "cuCtxGetCurrent")); |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation: line 80 uses one space for indentation while the surrounding lines (76, 78, 82) use four spaces. This should be aligned consistently with the other GetProcAddress calls.
| GetProcAddress(cuda_driver_dll_, "cuCtxGetCurrent")); | |
| GetProcAddress(cuda_driver_dll_, "cuCtxGetCurrent")); |
| ASSERT_EQ(cudaGetDeviceCount(&cuda_device_count), cudaSuccess); | ||
| for (; cuda_device_id < cuda_device_count; ++cuda_device_id) { | ||
| cudaDeviceProp prop; | ||
| unsigned int node_mask = 0; |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'node_mask' is declared but never used. Consider removing it or adding a comment explaining why it's present.
| unsigned int node_mask = 0; |
| cuCtxPushCurrent_fn = reinterpret_cast<cuCtxPushCurrent_t>( | ||
| GetProcAddress(cuda_driver_dll_, "cuCtxPushCurrent")); |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CudaDriverLoader class loads the cuCtxPushCurrent function pointer but it is never used anywhere in the test. Consider removing it if it's not needed, or add a comment explaining why it's loaded but not used.
| } | ||
| } | ||
| ASSERT_TRUE(found); | ||
| // Global instance of CUDA driver function loader |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "Global instance of CUDA driver function loader" is misleading since cuda_driver_loader is a local variable, not a global. Consider removing or updating this comment to accurately reflect the variable's scope.
| // Global instance of CUDA driver function loader | |
| // Create CUDA driver function loader instance |
| if (!IsEPAvailable()) { | ||
| GTEST_SKIP() << "NvTensorRtRtx EP not available"; | ||
| } | ||
|
|
||
| // Create a simple ReLU model using shared utility pattern | ||
| PathString model_path = ORT_TSTR("external_mem_relu_test.onnx"); | ||
| { | ||
| onnxruntime::Model model("relu_test", false, DefaultLoggingManager().DefaultLogger()); | ||
| auto& graph = model.MainGraph(); | ||
|
|
||
| ONNX_NAMESPACE::TypeProto tensor_type; | ||
| tensor_type.mutable_tensor_type()->set_elem_type(ONNX_NAMESPACE::TensorProto_DataType_FLOAT); | ||
| tensor_type.mutable_tensor_type()->mutable_shape()->add_dim()->set_dim_value(1); | ||
| tensor_type.mutable_tensor_type()->mutable_shape()->add_dim()->set_dim_value(3); | ||
| tensor_type.mutable_tensor_type()->mutable_shape()->add_dim()->set_dim_value(64); | ||
| tensor_type.mutable_tensor_type()->mutable_shape()->add_dim()->set_dim_value(64); | ||
|
|
||
| auto& input_arg = graph.GetOrCreateNodeArg("X", &tensor_type); | ||
| auto& output_arg = graph.GetOrCreateNodeArg("Y", &tensor_type); | ||
| graph.AddNode("relu", "Relu", "ReLU operation", {&input_arg}, {&output_arg}); | ||
|
|
||
| ASSERT_STATUS_OK(graph.Resolve()); | ||
| ASSERT_STATUS_OK(onnxruntime::Model::Save(model, model_path)); | ||
| } | ||
|
|
||
| const int64_t batch = 1, channels = 3, dim = 64; | ||
| const int64_t shape[] = {batch, channels, dim, dim}; | ||
| const size_t num_elements = batch * channels * dim * dim; | ||
| const size_t buffer_size = num_elements * sizeof(float); | ||
|
|
||
| // Create external resource importer | ||
| OrtExternalResourceImporter* importer = nullptr; | ||
| OrtStatus* status = ort_interop_api_->CreateExternalResourceImporterForDevice(ep_device_, &importer); | ||
| if (status != nullptr || importer == nullptr) { | ||
| if (status != nullptr) ort_api_->ReleaseStatus(status); | ||
| clearFileIfExists(model_path); | ||
| GTEST_SKIP() << "External resource import not supported"; |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CIG context created at line 937 will be leaked if the test is skipped at lines 941 or 976 (or if any ASSERT fails). The context is never destroyed in the cleanup section either. This causes a resource leak. Consider using an RAII wrapper to ensure the context is destroyed in all code paths, including early exits from GTEST_SKIP or failed assertions.
| if (status != nullptr || importer == nullptr) { | ||
| if (status != nullptr) ort_api_->ReleaseStatus(status); | ||
| GTEST_SKIP() << "External resource import not supported"; | ||
| GTEST_SKIP() << "External Onv_exeresource import not supported"; |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message contains a typo: "External Onv_exeresource import not supported" should be "External resource import not supported".
| GTEST_SKIP() << "External Onv_exeresource import not supported"; | |
| GTEST_SKIP() << "External resource import not supported"; |
|
Functionally, the change looks good to me. |
|
@gaugarg-nv @ankan-ban can you please review this as well? Thanks. |
|
Thanks, Max, for writing the test. It looks good. It's nice to see most of the interop functionality nicely abstracted out behind the new ORT interop APIs. There are just a couple of things that are still Nvidia specific in the test (maybe consider abstracting out these too in the future with more additions to ORT APIs):
I think resolving the above 2 would allow app developers to write truly IHV agonistic code that runs everywhere (e.g, using DX12 APIs for allocating resources, doing any pre/post processing and generic ORT APIs to run the model). |
|
Thanks @ankan-ban for the review.
@nieubank can you help tag this for 1.24 since we would like to make sure that this goes in with the newly added Interop API. I can take care of rebasing to main and accepting some of the copilot comments if that's all. |
ca44c43 to
79d020e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc
Show resolved
Hide resolved
onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc
Show resolved
Hide resolved
| # Licensed under the MIT License. | ||
| find_package(CUDAToolkit REQUIRED 12.8) | ||
| if(onnxruntime_CUDA_HOME) | ||
| file(TO_CMAKE_PATH CUDAToolkit_ROOT ${onnxruntime_CUDA_HOME}) |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arguments to file(TO_CMAKE_PATH ...) are reversed. The correct order should be: file(TO_CMAKE_PATH ${onnxruntime_CUDA_HOME} CUDAToolkit_ROOT) where the input path comes first, followed by the output variable.
| file(TO_CMAKE_PATH CUDAToolkit_ROOT ${onnxruntime_CUDA_HOME}) | |
| file(TO_CMAKE_PATH ${onnxruntime_CUDA_HOME} CUDAToolkit_ROOT) |
|
#26988 added the stream in RunOptions and is now merged. |
|
>DO you consider this as a big blocker ? I though of this small driver API usage as OK for an ISV to integrate, let me know if you think different |
3d012ef to
468eff1
Compare
|
I rebased on the refined structs and started providing the stream as run option. There are missing changes to support this for CiG but we are tracking this internally. |
| // Run inference. ORT submits all work to the stream before returning, so we signal the async semaphore below. | ||
| Ort::RunOptions run_options; | ||
| run_options.SetSyncStream(ort_stream); | ||
| run_options.AddConfigEntry(kOrtRunOptionsConfigDisableSynchronizeExecutionProviders, "1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this setting required? Not sure that everything is fully thought through if someone sets this flag.
e.g. if set, we do not call Stream::CleanupOnRunEnd here, and that (afaics) is the only place CudaStream frees deferred_cpu_buffers_. ~CudaStream ignores them and we're storing them a void* not a unique_ptr that could auto-delete so potentially that's a leak.
We also don't call Stream::Flush if set which has a cudaStreamSynchronize in CudaStream::Flush
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is required to get all the benefits of DX interop. We are synchronizing with a semaphore to not wait on CPU, but have an async synchronization between CUDA and DX on the GPU.
If this setting can not be used there's not really a point in using semaphores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all in the CudaStream implementation so you control it. Was mainly pointing out that as it currently stands I think that implementation has issues like a memory leak of deferred_cpu_buffers_.
Are there scenarios where the inference output is doing to be externally consumed on GPU and a cudaStreamSynchronize needs to be avoided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for sure. Any stable diffusion inference or video processing pipeline should always be synchronized externally. We wrote a devblog on this a few years back: https://developer.nvidia.com/blog/end-to-end-ai-for-nvidia-based-pcs-cuda-and-tensorrt-execution-providers-in-onnx-runtime/#sample_application
| @@ -0,0 +1,1190 @@ | |||
| // Copyright (c) Microsoft Corporation. All rights reserved. | |||
Check warning
Code scanning / lintrunner
CLANGFORMAT/format Warning test
Run lintrunner -a to apply this patch.
onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc
Outdated
Show resolved
Hide resolved
onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc
Show resolved
Hide resolved
onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider_utils.h
Outdated
Show resolved
Hide resolved
| // Validate tensor offset does not exceed available buffer size | ||
| size_t available_size = handle->descriptor.size_bytes - handle->descriptor.offset_bytes; | ||
| if (tensor_desc->offset_bytes > available_size) { | ||
| return impl.ort_api.CreateStatus(ORT_INVALID_ARGUMENT, | ||
| "tensor offset_bytes exceeds available imported memory size"); | ||
| } | ||
|
|
||
| // Calculate the data pointer with tensor offset | ||
| void* data_ptr = reinterpret_cast<void*>(handle->mapped_ptr + tensor_desc->offset_bytes); | ||
|
|
||
| // Get memory info from the EP device (the importer is associated with the OrtEpDevice) | ||
| const OrtMemoryInfo* memory_info = impl.ep_device_->device_memory_info; | ||
|
|
||
| // Create tensor that references the imported memory. The tensor does not own the memory - | ||
| // the user manages the lifetime of both the OrtValue and OrtExternalMemoryHandle. | ||
| // The user must keep the handle alive while the tensor is in use. | ||
| // No deleter is needed since this is for inference inputs/outputs where the user controls lifetime. | ||
| OrtStatus* status = impl.ort_api.CreateTensorWithDataAsOrtValue( | ||
| memory_info, | ||
| data_ptr, | ||
| handle->descriptor.size_bytes - tensor_desc->offset_bytes, | ||
| tensor_desc->shape, | ||
| tensor_desc->rank, | ||
| tensor_desc->element_type, | ||
| out_tensor); | ||
|
|
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateTensorFromMemoryImpl computes available_size as handle->descriptor.size_bytes - handle->descriptor.offset_bytes, but then passes handle->descriptor.size_bytes - tensor_desc->offset_bytes as the backing buffer size. This ignores the imported memory's offset_bytes (and can also exceed the mapped range), which can lead to tensors that overrun the mapped external memory. Use the available mapped size (i.e., available_size - tensor_desc->offset_bytes) when calling CreateTensorWithDataAsOrtValue.
onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc
Outdated
Show resolved
Hide resolved
onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc
Show resolved
Hide resolved
| CUcontext cu_context = 0; | ||
| CU_CALL_THROW(cuCtxGetCurrent(&cu_context)); | ||
| if (!cu_context) { | ||
| // cuCtxGetCurrent succeeded but returned nullptr, which indicates that no CUDA context | ||
| // is currently set for this thread. This implicates that there is not user created context. | ||
| // We use runtime API to initialize a context for the specified device. | ||
| CUDA_CALL_THROW(cudaSetDevice(device_id)); | ||
| CU_CALL_THROW(cuCtxGetCurrent(&cu_context)); | ||
| } |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ScopedContext helper calls cudaSetDevice(device_id) when there is no current CUDA context, which will create/attach a CUDA runtime context. The PR description explicitly calls out that cudaSetDevice must not be used in the external import/CIG flow because it can create a new context and break driver/runtime context management. Consider switching to a pure-driver approach (e.g., cuDevicePrimaryCtxRetain + cuCtxPushCurrent, or requiring an existing context) instead of invoking the runtime API here.
| // The handle has a Release callback that does the actual cleanup | ||
| // This method is called from OrtExternalResourceImporterImpl::ReleaseMemory | ||
| // The Release callback in the handle will call the static ReleaseCallback | ||
| auto* mem_handle = static_cast<NvTrtRtxExternalMemoryHandle*>(handle); | ||
|
|
||
| // Destroy the external memory object (also releases mapped buffer) | ||
| if (mem_handle->ext_memory != nullptr) { | ||
| cuDestroyExternalMemory(mem_handle->ext_memory); | ||
| } | ||
|
|
||
| delete mem_handle; |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interop API releases imported handles via handle->Release(handle) (see core/session/interop_api.cc), but NvTrtRtxExternalResourceImporterImpl also provides ReleaseMemoryImpl/ReleaseSemaphoreImpl that independently destroy the CUDA objects and delete the handle. If these importer-level release callbacks are ever invoked (now or in future API plumbing), this creates a high risk of double-destruction because the handle’s own Release callback already does the same cleanup. To avoid double frees, make the importer-level Release* implementations delegate to handle->Release (or make the handle Release callback call the importer’s Release*), but don’t do both cleanups separately.
| // The handle has a Release callback that does the actual cleanup | |
| // This method is called from OrtExternalResourceImporterImpl::ReleaseMemory | |
| // The Release callback in the handle will call the static ReleaseCallback | |
| auto* mem_handle = static_cast<NvTrtRtxExternalMemoryHandle*>(handle); | |
| // Destroy the external memory object (also releases mapped buffer) | |
| if (mem_handle->ext_memory != nullptr) { | |
| cuDestroyExternalMemory(mem_handle->ext_memory); | |
| } | |
| delete mem_handle; | |
| // Delegate cleanup to the handle's Release callback to avoid double-destruction. | |
| if (handle->Release != nullptr) { | |
| handle->Release(handle); | |
| } |
onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc
Outdated
Show resolved
Hide resolved
onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc
Show resolved
Hide resolved
|
@gedoensmax Please, review, address, comment on, resolve Copilot comments. They are often useful. |
This PR adds a test for CiG inference to demonstrate how usage for it should look like.
It is important to not call
cudaSetDevicein that flow since it will create a new context. @nieubank I am not sure why there was acudaSetDeviceon each import call 🤔 Is this done to enable importing semaphores of e.g. GPU:1 to a session running on GPU:0 ? Context management is unreliable with the current CUDA runtime and CUDA driver API mixing.