Skip to content

Conversation

@gedoensmax
Copy link
Contributor

This PR adds a test for CiG inference to demonstrate how usage for it should look like.
It is important to not call cudaSetDevice in that flow since it will create a new context. @nieubank I am not sure why there was a cudaSetDevice on each import call 🤔 Is this done to enable importing semaphores of e.g. GPU:1 to a session running on GPU:0 ? Context management is unreliable with the current CUDA runtime and CUDA driver API mixing.

@nieubank
Copy link
Contributor

nieubank commented Jan 8, 2026

This PR adds a test for CiG inference to demonstrate how usage for it should look like. It is important to not call cudaSetDevice in that flow since it will create a new context. @nieubank I am not sure why there was a cudaSetDevice on each import call 🤔 Is this done to enable importing semaphores of e.g. GPU:1 to a session running on GPU:0 ? Context management is unreliable with the current CUDA runtime and CUDA driver API mixing.

Awesome, thanks for this! You're seeing my inexperience with the CUDA API here :), I have another branch I was working on to fix some of the context stuff, but I figure this implementation will be a longer-term collaboration/hand-off at some point. Just wanted to validate the API with some real code.

@gedoensmax
Copy link
Contributor Author

Yes sure, we (or in other words @praneshgo ) will probably take it over. I made these changes for the exact same reason to experiment with it. And i already identified a TRT RTX optimization opportunity that we will fix internally.

To better test the correct async behaviour by the way it might be better to submit multiple inferences and wait on the last result to ensure that we are not synchronous due to CPU overhead.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive test for CUDA Interop Graphics (CIG) inference to demonstrate proper usage patterns when working with external D3D12 resources. The key change is modifying context management to avoid calling cudaSetDevice when a CUDA context already exists, which prevents creating unwanted new contexts during CIG workflows.

Changes:

  • Added FullInferenceWithExternalMemoryCIG test demonstrating CIG context usage with external memory import
  • Modified context management in nv_provider_factory.cc to check for existing CUDA contexts before calling cudaSetDevice
  • Migrated test API calls from ort_api_ to ort_interop_api_ for external resource operations

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc Added CudaDriverLoader helper class, renamed test fixture to NvExecutionProviderExternalResourceImporterTest, migrated to interop API, and added comprehensive CIG inference test
onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc Modified ImportMemory, ImportSemaphore, and CreateSyncStreamForDevice to check for existing CUDA context before calling cudaSetDevice

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime::nv::provider_option_names::kUserComputeStream,
onnxruntime::nv::provider_option_names::kHasUserComputeStream,
onnxruntime::nv::provider_option_names::kMaxSharedMemSize,
// TRT will create it's own context to create streams if we do not manually provide aux streams
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment has a typo: "it's own context" should be "its own context" (possessive pronoun doesn't use an apostrophe).

Suggested change
// TRT will create it's own context to create streams if we do not manually provide aux streams
// TRT will create its own context to create streams if we do not manually provide aux streams

Copilot uses AI. Check for mistakes.
if (CUDA_SUCCESS != cuCtxGetCurrent(&cu_context)) {
return impl.ort_api.CreateStatus(ORT_EP_FAIL, "cuCtxGetCurrent failed");
}
if (!cu_context) { // do not alter currently set context
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context management logic should check the return value of cuCtxGetCurrent before dereferencing cu_context. If cuCtxGetCurrent fails (returns non-CUDA_SUCCESS), the code currently returns an error, but if it succeeds and cu_context is null, that's a valid state. However, the current implementation checks if cuCtxGetCurrent fails first, which is correct. Consider adding a comment to clarify that a null context after successful cuCtxGetCurrent is expected when no context is set.

Suggested change
if (!cu_context) { // do not alter currently set context
if (!cu_context) {
// cuCtxGetCurrent succeeded but returned nullptr, which indicates that no CUDA context
// is currently set for this thread. This is a valid state; in this case we explicitly
// reset and set the device for this EP without modifying any existing non-null context.

Copilot uses AI. Check for mistakes.
cuCtxSetCurrent_fn = reinterpret_cast<cuCtxSetCurrent_t>(
GetProcAddress(cuda_driver_dll_, "cuCtxSetCurrent"));
cuCtxGetCurrent_fn = reinterpret_cast<cuCtxGetCurrent_t>(
GetProcAddress(cuda_driver_dll_, "cuCtxGetCurrent"));
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: line 80 uses one space for indentation while the surrounding lines (76, 78, 82) use four spaces. This should be aligned consistently with the other GetProcAddress calls.

Suggested change
GetProcAddress(cuda_driver_dll_, "cuCtxGetCurrent"));
GetProcAddress(cuda_driver_dll_, "cuCtxGetCurrent"));

Copilot uses AI. Check for mistakes.
ASSERT_EQ(cudaGetDeviceCount(&cuda_device_count), cudaSuccess);
for (; cuda_device_id < cuda_device_count; ++cuda_device_id) {
cudaDeviceProp prop;
unsigned int node_mask = 0;
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'node_mask' is declared but never used. Consider removing it or adding a comment explaining why it's present.

Suggested change
unsigned int node_mask = 0;

Copilot uses AI. Check for mistakes.
Comment on lines 81 to 82
cuCtxPushCurrent_fn = reinterpret_cast<cuCtxPushCurrent_t>(
GetProcAddress(cuda_driver_dll_, "cuCtxPushCurrent"));
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CudaDriverLoader class loads the cuCtxPushCurrent function pointer but it is never used anywhere in the test. Consider removing it if it's not needed, or add a comment explaining why it's loaded but not used.

Copilot uses AI. Check for mistakes.
}
}
ASSERT_TRUE(found);
// Global instance of CUDA driver function loader
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Global instance of CUDA driver function loader" is misleading since cuda_driver_loader is a local variable, not a global. Consider removing or updating this comment to accurately reflect the variable's scope.

Suggested change
// Global instance of CUDA driver function loader
// Create CUDA driver function loader instance

Copilot uses AI. Check for mistakes.
Comment on lines 940 to 976
if (!IsEPAvailable()) {
GTEST_SKIP() << "NvTensorRtRtx EP not available";
}

// Create a simple ReLU model using shared utility pattern
PathString model_path = ORT_TSTR("external_mem_relu_test.onnx");
{
onnxruntime::Model model("relu_test", false, DefaultLoggingManager().DefaultLogger());
auto& graph = model.MainGraph();

ONNX_NAMESPACE::TypeProto tensor_type;
tensor_type.mutable_tensor_type()->set_elem_type(ONNX_NAMESPACE::TensorProto_DataType_FLOAT);
tensor_type.mutable_tensor_type()->mutable_shape()->add_dim()->set_dim_value(1);
tensor_type.mutable_tensor_type()->mutable_shape()->add_dim()->set_dim_value(3);
tensor_type.mutable_tensor_type()->mutable_shape()->add_dim()->set_dim_value(64);
tensor_type.mutable_tensor_type()->mutable_shape()->add_dim()->set_dim_value(64);

auto& input_arg = graph.GetOrCreateNodeArg("X", &tensor_type);
auto& output_arg = graph.GetOrCreateNodeArg("Y", &tensor_type);
graph.AddNode("relu", "Relu", "ReLU operation", {&input_arg}, {&output_arg});

ASSERT_STATUS_OK(graph.Resolve());
ASSERT_STATUS_OK(onnxruntime::Model::Save(model, model_path));
}

const int64_t batch = 1, channels = 3, dim = 64;
const int64_t shape[] = {batch, channels, dim, dim};
const size_t num_elements = batch * channels * dim * dim;
const size_t buffer_size = num_elements * sizeof(float);

// Create external resource importer
OrtExternalResourceImporter* importer = nullptr;
OrtStatus* status = ort_interop_api_->CreateExternalResourceImporterForDevice(ep_device_, &importer);
if (status != nullptr || importer == nullptr) {
if (status != nullptr) ort_api_->ReleaseStatus(status);
clearFileIfExists(model_path);
GTEST_SKIP() << "External resource import not supported";
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CIG context created at line 937 will be leaked if the test is skipped at lines 941 or 976 (or if any ASSERT fails). The context is never destroyed in the cleanup section either. This causes a resource leak. Consider using an RAII wrapper to ensure the context is destroyed in all code paths, including early exits from GTEST_SKIP or failed assertions.

Copilot uses AI. Check for mistakes.
if (status != nullptr || importer == nullptr) {
if (status != nullptr) ort_api_->ReleaseStatus(status);
GTEST_SKIP() << "External resource import not supported";
GTEST_SKIP() << "External Onv_exeresource import not supported";
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message contains a typo: "External Onv_exeresource import not supported" should be "External resource import not supported".

Suggested change
GTEST_SKIP() << "External Onv_exeresource import not supported";
GTEST_SKIP() << "External resource import not supported";

Copilot uses AI. Check for mistakes.
@praneshgo
Copy link
Contributor

Functionally, the change looks good to me.

@praneshgo
Copy link
Contributor

@gaugarg-nv @ankan-ban can you please review this as well? Thanks.

@ankan-ban
Copy link
Contributor

ankan-ban commented Jan 15, 2026

Thanks, Max, for writing the test. It looks good. It's nice to see most of the interop functionality nicely abstracted out behind the new ORT interop APIs.

There are just a couple of things that are still Nvidia specific in the test (maybe consider abstracting out these too in the future with more additions to ORT APIs):

  1. Use of CUDA APIs by the app to create the context before invoking ORT (this requires a command queue handle that would share same TSG with the cuda context. The same TSG is a requirement to enable CiG on our hardware).

  2. Use of nv-specific session options (kUserComputeStream, kHasUserComputeStream, kMaxSharedMemSize). Maybe having a mechanism for passing the generic ORT-stream object to session.run() makes even more sense. The shared memory size limit is again required for running in CiG mode - and hopefully if we have a generic way of doing "1" above - it can allow the EP to automatically set the correct value depending on the GPU.

I think resolving the above 2 would allow app developers to write truly IHV agonistic code that runs everywhere (e.g, using DX12 APIs for allocating resources, doing any pre/post processing and generic ORT APIs to run the model).

@gedoensmax
Copy link
Contributor Author

Thanks @ankan-ban for the review.

  1. DO you consider this as a big blocker ? I though of this small driver API usage as OK for an ISV to integrate, let me know if you think different.
  2. Fully agree ! I would love to set the kMaxSharedMemSize implicit but i did not found a way to check for the max supported shared mem size based on the currently pushed context. Having a stream as input on Ort::Session::Run would be great and @skottmckay has mentioned that on another thread, can you let us know what the status on this is ?

@nieubank can you help tag this for 1.24 since we would like to make sure that this goes in with the newly added Interop API. I can take care of rebasing to main and accepting some of the copilot comments if that's all.

@nieubank nieubank added this to the 1.24.0 milestone Jan 15, 2026
@gedoensmax gedoensmax force-pushed the maximilianm/nv_ext_importer branch from ca44c43 to 79d020e Compare January 16, 2026 23:33
@gedoensmax gedoensmax changed the base branch from nieubank/nv_ext_importer to main January 16, 2026 23:33
@gedoensmax gedoensmax changed the title Add test for CIG inference [TRT RTX EP] Add support for D3D12 external resourrce import Jan 16, 2026
@gedoensmax gedoensmax requested a review from Copilot January 16, 2026 23:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Licensed under the MIT License.
find_package(CUDAToolkit REQUIRED 12.8)
if(onnxruntime_CUDA_HOME)
file(TO_CMAKE_PATH CUDAToolkit_ROOT ${onnxruntime_CUDA_HOME})
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arguments to file(TO_CMAKE_PATH ...) are reversed. The correct order should be: file(TO_CMAKE_PATH ${onnxruntime_CUDA_HOME} CUDAToolkit_ROOT) where the input path comes first, followed by the output variable.

Suggested change
file(TO_CMAKE_PATH CUDAToolkit_ROOT ${onnxruntime_CUDA_HOME})
file(TO_CMAKE_PATH ${onnxruntime_CUDA_HOME} CUDAToolkit_ROOT)

Copilot uses AI. Check for mistakes.
@gedoensmax gedoensmax changed the title [TRT RTX EP] Add support for D3D12 external resourrce import [TRT RTX EP] Add support for D3D12 external resource import Jan 16, 2026
@skottmckay
Copy link
Contributor

#26988 added the stream in RunOptions and is now merged.

@ankan-ban
Copy link
Contributor

>DO you consider this as a big blocker ? I though of this small driver API usage as OK for an ISV to integrate, let me know if you think different
Agree that it's not a big blocker - but after #26988 it seems the only NV specific thing that the app needs to do. Agree that it's likely not too much for ISVs to integrate.

@gedoensmax gedoensmax force-pushed the maximilianm/nv_ext_importer branch from 3d012ef to 468eff1 Compare January 19, 2026 15:02
@gedoensmax
Copy link
Contributor Author

I rebased on the refined structs and started providing the stream as run option. There are missing changes to support this for CiG but we are tracking this internally.

// Run inference. ORT submits all work to the stream before returning, so we signal the async semaphore below.
Ort::RunOptions run_options;
run_options.SetSyncStream(ort_stream);
run_options.AddConfigEntry(kOrtRunOptionsConfigDisableSynchronizeExecutionProviders, "1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this setting required? Not sure that everything is fully thought through if someone sets this flag.

e.g. if set, we do not call Stream::CleanupOnRunEnd here, and that (afaics) is the only place CudaStream frees deferred_cpu_buffers_. ~CudaStream ignores them and we're storing them a void* not a unique_ptr that could auto-delete so potentially that's a leak.

We also don't call Stream::Flush if set which has a cudaStreamSynchronize in CudaStream::Flush

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is required to get all the benefits of DX interop. We are synchronizing with a semaphore to not wait on CPU, but have an async synchronization between CUDA and DX on the GPU.

If this setting can not be used there's not really a point in using semaphores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all in the CudaStream implementation so you control it. Was mainly pointing out that as it currently stands I think that implementation has issues like a memory leak of deferred_cpu_buffers_.

Are there scenarios where the inference output is doing to be externally consumed on GPU and a cudaStreamSynchronize needs to be avoided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for sure. Any stable diffusion inference or video processing pipeline should always be synchronized externally. We wrote a devblog on this a few years back: https://developer.nvidia.com/blog/end-to-end-ai-for-nvidia-based-pcs-cuda-and-tensorrt-execution-providers-in-onnx-runtime/#sample_application

@@ -0,0 +1,1190 @@
// Copyright (c) Microsoft Corporation. All rights reserved.

Check warning

Code scanning / lintrunner

CLANGFORMAT/format Warning test

See https://clang.llvm.org/docs/ClangFormat.html.
Run lintrunner -a to apply this patch.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +780 to +805
// Validate tensor offset does not exceed available buffer size
size_t available_size = handle->descriptor.size_bytes - handle->descriptor.offset_bytes;
if (tensor_desc->offset_bytes > available_size) {
return impl.ort_api.CreateStatus(ORT_INVALID_ARGUMENT,
"tensor offset_bytes exceeds available imported memory size");
}

// Calculate the data pointer with tensor offset
void* data_ptr = reinterpret_cast<void*>(handle->mapped_ptr + tensor_desc->offset_bytes);

// Get memory info from the EP device (the importer is associated with the OrtEpDevice)
const OrtMemoryInfo* memory_info = impl.ep_device_->device_memory_info;

// Create tensor that references the imported memory. The tensor does not own the memory -
// the user manages the lifetime of both the OrtValue and OrtExternalMemoryHandle.
// The user must keep the handle alive while the tensor is in use.
// No deleter is needed since this is for inference inputs/outputs where the user controls lifetime.
OrtStatus* status = impl.ort_api.CreateTensorWithDataAsOrtValue(
memory_info,
data_ptr,
handle->descriptor.size_bytes - tensor_desc->offset_bytes,
tensor_desc->shape,
tensor_desc->rank,
tensor_desc->element_type,
out_tensor);

Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateTensorFromMemoryImpl computes available_size as handle->descriptor.size_bytes - handle->descriptor.offset_bytes, but then passes handle->descriptor.size_bytes - tensor_desc->offset_bytes as the backing buffer size. This ignores the imported memory's offset_bytes (and can also exceed the mapped range), which can lead to tensors that overrun the mapped external memory. Use the available mapped size (i.e., available_size - tensor_desc->offset_bytes) when calling CreateTensorWithDataAsOrtValue.

Copilot uses AI. Check for mistakes.
Comment on lines +707 to +715
CUcontext cu_context = 0;
CU_CALL_THROW(cuCtxGetCurrent(&cu_context));
if (!cu_context) {
// cuCtxGetCurrent succeeded but returned nullptr, which indicates that no CUDA context
// is currently set for this thread. This implicates that there is not user created context.
// We use runtime API to initialize a context for the specified device.
CUDA_CALL_THROW(cudaSetDevice(device_id));
CU_CALL_THROW(cuCtxGetCurrent(&cu_context));
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ScopedContext helper calls cudaSetDevice(device_id) when there is no current CUDA context, which will create/attach a CUDA runtime context. The PR description explicitly calls out that cudaSetDevice must not be used in the external import/CIG flow because it can create a new context and break driver/runtime context management. Consider switching to a pure-driver approach (e.g., cuDevicePrimaryCtxRetain + cuCtxPushCurrent, or requiring an existing context) instead of invoking the runtime API here.

Copilot uses AI. Check for mistakes.
Comment on lines 746 to 756
// The handle has a Release callback that does the actual cleanup
// This method is called from OrtExternalResourceImporterImpl::ReleaseMemory
// The Release callback in the handle will call the static ReleaseCallback
auto* mem_handle = static_cast<NvTrtRtxExternalMemoryHandle*>(handle);

// Destroy the external memory object (also releases mapped buffer)
if (mem_handle->ext_memory != nullptr) {
cuDestroyExternalMemory(mem_handle->ext_memory);
}

delete mem_handle;
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interop API releases imported handles via handle->Release(handle) (see core/session/interop_api.cc), but NvTrtRtxExternalResourceImporterImpl also provides ReleaseMemoryImpl/ReleaseSemaphoreImpl that independently destroy the CUDA objects and delete the handle. If these importer-level release callbacks are ever invoked (now or in future API plumbing), this creates a high risk of double-destruction because the handle’s own Release callback already does the same cleanup. To avoid double frees, make the importer-level Release* implementations delegate to handle->Release (or make the handle Release callback call the importer’s Release*), but don’t do both cleanups separately.

Suggested change
// The handle has a Release callback that does the actual cleanup
// This method is called from OrtExternalResourceImporterImpl::ReleaseMemory
// The Release callback in the handle will call the static ReleaseCallback
auto* mem_handle = static_cast<NvTrtRtxExternalMemoryHandle*>(handle);
// Destroy the external memory object (also releases mapped buffer)
if (mem_handle->ext_memory != nullptr) {
cuDestroyExternalMemory(mem_handle->ext_memory);
}
delete mem_handle;
// Delegate cleanup to the handle's Release callback to avoid double-destruction.
if (handle->Release != nullptr) {
handle->Release(handle);
}

Copilot uses AI. Check for mistakes.
@yuslepukhin
Copy link
Member

@gedoensmax Please, review, address, comment on, resolve Copilot comments. They are often useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants