Only add support for CUDA devices in GetSupportedDevices by chilo-ms · Pull Request #4 · onnxruntime/onnxruntime-ep-tensorrt

chilo-ms · 2026-04-01T01:15:53Z

This PR mainly has several changes:

In GetSupportedDevicesImpl, add hardware device vendor ID check before claiming the support of this hardware device.
CUDA uses contiguous ordinals for CUDA-visible NVIDIA devices. This implementation references plugin CUDA EP to hold a device cache where it uses CUDA device id for OrtMemoryInfo creation.
Add TensorRT builder placeholder for test scenarios

CLAassistant · 2026-04-10T20:23:45Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

yuslepukhin · 2026-04-13T18:36:31Z

// delete static_cast<TRTEpDataTransfer*>(this_ptr);

All instances of DataTransfer now leak.

Refers to: src/tensorrt_execution_provider_data_transfer.cc:114 in 41ea68c. [](commit_id = 41ea68c, deletion_comment = False)

yuslepukhin · 2026-04-13T18:37:04Z

  auto& factory = *static_cast<TensorrtExecutionProviderFactory*>(this_ptr);
-  *data_transfer = factory.data_transfer_impl.get();
+
+  auto data_transfer_impl = std::make_unique<TRTEpDataTransfer>(static_cast<const ApiPtrs&>(factory));


This leaks due to ReleaseImpl() now commented out.

remove the comment and add back delete static_cast<TRTEpDataTransfer*>(this_ptr);

yuslepukhin · 2026-04-13T18:38:53Z

+  }
+
+  if (num_cuda_devices == 0) {
+    Ort::ThrowOnError(ort_api->Logger_LogMessage(default_logger,


This will throw a C++ exception through a C API boundary which will immediatly terminate the process.
In general, every C API implemented in C++ should guard against exceptions that can rip through the C API into a C program.

I changed to use RETURN_IF_ERROR instead.

Should wrap every C API in try/catch macros.

yuslepukhin · 2026-04-13T18:41:14Z

+
+  // Query CUDA device count once upfront so we can validate assigned ordinals.
+  int cuda_device_count = 0;
+  cudaError_t cuda_err = cudaGetDeviceCount(&cuda_device_count);


Here and below: cudaGetDeviceCount failure is treated as no devices. This is a rare catastrophic failure which must be reported. In CreateEpFactories, cudaGetDeviceCount failure returns ORT_EP_FAIL.
Plugin creation can still fail on systems without usable CUDA runtime, which conflicts with the stated PR intent of graceful enumeration behavior when CUDA devices are unavailable.
The no-device case was improved, but error-path semantics remain inconsistent with that design intent.

I updated the code in CreateEpFactories and GetSupportedDevicesImpl, they are consistent now and only log warning message if no CUDA devices available

There are two situations 1) no devices 2) cuda API fails. The latter must error out, not just a warning.

yuslepukhin · 2026-04-13T18:42:35Z

+    // CUDA uses contiguous ordinals for CUDA-visible NVIDIA devices. Build that
+    // mapping from the filtered hardware-device list instead of relying on the
+    // ORT hardware device id, which is not guaranteed to be a CUDA ordinal.
+    int current_device_id = cuda_device_index++;


Current code still assigns CUDA ordinals by filtered enumeration order not from a direct CUDA ordinal provided by hardware-device metadata.

If ORT hardware enumeration order diverges from CUDA-visible ordinal order, allocator/memory-info association can still mismatch.

Partially addressed (bounds check added at tensorrt_provider_factory.cc:169), but the deeper ordering-assumption concern remains.

Good catch!
The device ordering assumption is a concern.

To address this i use cudaDeviceGetByPCIBusId to get the cuda device ordinal by providing PCI Bus ID.
ORT currently doesn't have PCI Bus ID as device metadata on Windows, and i created a PR for it.
Also, the plugin CUDA EP also has the same ordering assumption issue and i address it as well.

yuslepukhin · 2026-04-13T18:43:48Z

+  }
+
  // Manual init for the C++ API
  Ort::InitApi(ort_api);


Ort::InitApi(ort_api) should be on the first things to do. E.g. Ort::ThrowOnError() requires it but it is not initialized.

yuslepukhin · 2026-04-15T18:46:43Z

+    return ort_api->CreateStatus(ORT_RUNTIME_EXCEPTION, err_msg.c_str());
+  }
+
+  try {


This C API is still not guarding against exceptions. This try/catch is too narrow.

yuslepukhin · 2026-04-15T18:50:36Z

-
-    cuda_pinned_memory_infos[device_id] = MemoryInfoUniquePtr(mem_info, ort_api.ReleaseMemoryInfo);
-  }
+const OrtMemoryInfo* TensorrtExecutionProviderFactory::GetMemoryInfoByOrdinal(int cuda_ordinal, bool is_pinned) {


Thread about 6-space indentation in new blocks
Reviewer concern: inconsistent block indentation vs surrounding function style.
Current state: still mildly inconsistent in helper blocks that use 4-space body indentation where nearby functions mostly use 2-space body indentation.
Examples:
tensorrt_provider_factory.cc:98
tensorrt_provider_factory.cc:99
tensorrt_provider_factory.cc:103
tensorrt_provider_factory.cc:420
tensorrt_provider_factory.cc:421

Do you run lintrunner ?

yuslepukhin · 2026-04-15T18:51:26Z

  ReleaseAllocator = ReleaseAllocatorImpl;
-
  CreateDataTransfer = CreateDataTransferImpl;
+  IsStreamAware = IsStreamAwareImpl; 


Extra whitespace-only artifact still present

yuslepukhin · 2026-04-15T19:29:43Z

+    };
 }

 OrtStatus* ORT_API_CALL TensorrtExecutionProviderFactory::GetSupportedDevicesImpl(


This function still allows exception propagation. please, review all C API entry points.

yuslepukhin · 2026-04-15T19:35:36Z

+  } catch (const std::exception& ex) {
+    // Best-effort: ReleaseEpFactory shouldn't normally throw, but guard the C boundary.
+    (void)ex;
+  } catch (...) {


Catches all exceptions but still returns success.
Risk: teardown failure would be hidden from caller and troubleshooting becomes harder.

yuslepukhin

LGTM

update

21653db

chilo-ms temporarily deployed to Linux_x64 April 1, 2026 01:15 — with GitHub Actions Inactive

chilo-ms had a problem deploying to Windows_x64 April 1, 2026 01:15 — with GitHub Actions Failure

chilo-ms temporarily deployed to Windows_x64 April 1, 2026 01:15 — with GitHub Actions Inactive

chilo-ms requested review from edgchen1 and shaahji April 1, 2026 01:16

chilo-ms had a problem deploying to Linux_x64 April 1, 2026 01:33 — with GitHub Actions Failure

chilo-ms temporarily deployed to Linux_x64 April 1, 2026 01:33 — with GitHub Actions Inactive

edgchen1 reviewed Apr 1, 2026

View reviewed changes

Comment thread src/tensorrt_provider_factory.cc Outdated

check cuda api return value

1de07bd

chilo-ms had a problem deploying to Linux_x64 April 1, 2026 16:17 — with GitHub Actions Failure

chilo-ms temporarily deployed to Linux_x64 April 1, 2026 16:17 — with GitHub Actions Inactive

chilo-ms had a problem deploying to Windows_x64 April 1, 2026 16:17 — with GitHub Actions Failure

chilo-ms temporarily deployed to Windows_x64 April 1, 2026 16:17 — with GitHub Actions Inactive

edgchen1 reviewed Apr 1, 2026

View reviewed changes

Comment thread src/tensorrt_provider_factory.cc Outdated

edgchen1 reviewed Apr 1, 2026

View reviewed changes

Comment thread src/tensorrt_provider_factory.cc Outdated

address reviewer's comments

ab98037

chilo-ms temporarily deployed to Windows_x64 April 1, 2026 19:21 — with GitHub Actions Inactive

chilo-ms had a problem deploying to Windows_x64 April 1, 2026 19:21 — with GitHub Actions Failure

chilo-ms temporarily deployed to Windows_x64 April 1, 2026 19:21 — with GitHub Actions Inactive

chilo-ms temporarily deployed to Linux_x64 April 1, 2026 19:21 — with GitHub Actions Inactive

chilo-ms temporarily deployed to Windows_x64 April 2, 2026 20:32 — with GitHub Actions Inactive

chilo-ms had a problem deploying to Windows_x64 April 2, 2026 20:32 — with GitHub Actions Failure

chilo-ms had a problem deploying to Linux_x64 April 2, 2026 20:32 — with GitHub Actions Failure

chilo-ms temporarily deployed to Linux_x64 April 2, 2026 20:32 — with GitHub Actions Inactive

chilo-ms temporarily deployed to Linux_x64 April 2, 2026 21:47 — with GitHub Actions Inactive

chilo-ms temporarily deployed to Windows_x64 April 2, 2026 21:47 — with GitHub Actions Inactive

chilo-ms temporarily deployed to Linux_x64 April 2, 2026 22:09 — with GitHub Actions Inactive

chilo-ms had a problem deploying to Linux_x64 April 2, 2026 22:09 — with GitHub Actions Failure

chilo-ms temporarily deployed to Linux_x64 April 2, 2026 22:55 — with GitHub Actions Inactive

chilo-ms requested a review from yuslepukhin April 9, 2026 17:12

refactor

86b17b0

chilo-ms temporarily deployed to Windows_x64 April 10, 2026 20:23 — with GitHub Actions Inactive

chilo-ms had a problem deploying to Linux_x64 April 10, 2026 20:23 — with GitHub Actions Failure

add TRT builder place holder for test

41ea68c

yuslepukhin suggested changes Apr 13, 2026

View reviewed changes

chilo-ms added 2 commits April 14, 2026 11:25

address reviewer's comments

3699c7f

address reviewer's comments

0e36718

yuslepukhin reviewed Apr 15, 2026

View reviewed changes

address reviewer's comment

100f440

yuslepukhin reviewed Apr 15, 2026

View reviewed changes

add try/catch as C API boundary guard

e8e0de6

yuslepukhin approved these changes Apr 15, 2026

View reviewed changes

Conversation

chilo-ms commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Apr 10, 2026

Uh oh!

yuslepukhin commented Apr 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuslepukhin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chilo-ms commented Apr 1, 2026 •

edited

Loading