Add PCI bus ID as device metadata on Windows by chilo-ms · Pull Request #28066 · microsoft/onnxruntime

chilo-ms · 2026-04-14T18:12:50Z

Description

This PR mainly has two changes:

Add PCI bus ID as device metadata on Windows. (Linux already has this)
Plugin CUDA EP calls cudaDeviceGetByPCIBusId with pci bus id provided to get the cuda device ordinal as the device id (it will be later used for OrtMemoryInfo creation). Even though ORT hardware enumeration order diverges from CUDA-visible ordinal order, plugin CUDA EP makes sure the allocator/memory-info matches the device.

Motivation and Context

The current code in plugin CUDA EP
int current_device_id = cuda_device_index++;
assumes that the Nth NVIDIA GPU in ORT's hardware enumeration order corresponds to CUDA ordinal N. This is fragile because:

ORT enumerates hardware devices via OS-level APIs (e.g., DXGI on Windows, sysfs on Linux), which may order GPUs by PCI bus address.
CUDA orders devices by compute capability (descending), then by PCI bus ID — unless CUDA_DEVICE_ORDER=PCI_BUS_ID is set.
If these orderings diverge (e.g., a less capable GPU has a lower PCI bus address), CUDA ordinal 0 ≠ ORT's first NVIDIA device.

tianleiwu

Summary

Well-motivated fix that addresses a real correctness issue with CUDA device ordinal mapping when OS and CUDA enumeration orders diverge. The PCI bus ID approach is the right solution.

One high-priority concern about the fallback path producing duplicate ordinals when CUDA_VISIBLE_DEVICES restricts GPU visibility.

Positives

PCI bus ID resolution via cudaDeviceGetByPCIBusId is the correct, stable way to map physical GPUs to CUDA ordinals
Good defensive checks on metadata null, empty string, CUDA API failure, and ordinal range validation
Clean parsing of SPDRP_LOCATION_INFORMATION with correct decimal-to-hex conversion
Excellent test coverage, especially the PciBusIdRoundTrip format consistency test

Concerns

#	Severity	Component	Issue
1	High	cuda_ep_factory.cc	Fallback ordinal collision when PCI bus ID present but CUDA resolution fails
2	Suggestion	device_discovery.cc	PCI domain hardcoded to 0000
3	Nitpick	test file	Missing trailing newline

See inline comments for details.

tianleiwu · 2026-04-15T21:27:21Z

+      // Try to resolve the CUDA ordinal from pci_bus_id metadata if available.
+      // This is more reliable than counter-based ordinal assignment because it is
+      // not affected by enumeration order, CUDA_VISIBLE_DEVICES remapping, or
+      // mixed-vendor GPU configurations.
+      int current_device_id = -1;
+      const OrtKeyValuePairs* metadata = factory->ort_api_.HardwareDevice_Metadata(&device);
+      if (metadata != nullptr) {
+        const char* pci_bus_id = factory->ort_api_.GetKeyValue(metadata, "pci_bus_id");
+        if (pci_bus_id != nullptr && pci_bus_id[0] != '\0') {
+          int resolved_ordinal = -1;
+          cudaError_t err = cudaDeviceGetByPCIBusId(&resolved_ordinal, pci_bus_id);
+          if (err == cudaSuccess && resolved_ordinal >= 0 && resolved_ordinal < cuda_device_count) {
+            current_device_id = resolved_ordinal;
+          }
+        }
+      }
+
+      // Fallback: if pci_bus_id was not available, use counter-based ordinal assignment.
+      if (current_device_id < 0) {
+        current_device_id = cuda_device_index++;
+      }

      // Validate the assigned ordinal is within the range of CUDA-visible devices.


High — Fallback ordinal collision with CUDA_VISIBLE_DEVICES

When pci_bus_id is present but cudaDeviceGetByPCIBusId fails (e.g., device hidden by CUDA_VISIBLE_DEVICES), the code falls back to cuda_device_index++. This counter-based ordinal can collide with a PCI-resolved ordinal from another visible device.

Scenario (CUDA_VISIBLE_DEVICES=1, two NVIDIA GPUs):

cuda_device_count = 1 (one visible device, remapped to ordinal 0)

GPU0 (PCI 01:00.0): cudaDeviceGetByPCIBusId → cudaErrorInvalidDevice → fallback → ordinal 0

GPU1 (PCI 65:00.0): cudaDeviceGetByPCIBusId → ordinal 0 (the visible device)

Result: Both GPUs assigned ordinal 0

Suggested fix: When pci_bus_id IS present but CUDA resolution fails, skip the device (it's not CUDA-visible) rather than falling back to the counter:

if (pci_bus_id != nullptr && pci_bus_id[0] != '\0') { int resolved_ordinal = -1; cudaError_t err = cudaDeviceGetByPCIBusId(&resolved_ordinal, pci_bus_id); if (err == cudaSuccess && resolved_ordinal >= 0 && resolved_ordinal < cuda_device_count) { current_device_id = resolved_ordinal; } else { // Device has a PCI bus ID but is not visible to CUDA // (e.g., filtered by CUDA_VISIBLE_DEVICES). Skip it. continue; } }

tianleiwu · 2026-04-15T21:27:21Z

+        }
+      }
+
      // Generate telemetry event to log the GPU and NPU driver name and version.


Suggestion — PCI domain hardcoded to 0000

The domain is hardcoded to 0000 in the format string. While this is correct for the vast majority of systems, multi-segment PCI configurations (non-zero domain) exist in some enterprise server hardware. The Linux implementation extracts the actual domain from sysfs.

Consider adding a brief comment noting this limitation (e.g., // Domain is 0000 for standard single-segment PCI; multi-segment configs are not supported), or exploring DEVPKEY_Device_LocationPaths which may contain the full domain.

tianleiwu · 2026-04-15T21:27:21Z

+}  // namespace test
+}  // namespace onnxruntime
+
+#endif  // defined(ORT_UNIT_TEST_HAS_CUDA_PLUGIN_EP)


Nitpick: Missing trailing newline at end of file.

chilo-ms added 2 commits April 14, 2026 10:49

add get pci bus id as device metadata on Windows

aacb84e

update

a1e177d

chilo-ms mentioned this pull request Apr 14, 2026

Only add support for CUDA devices in GetSupportedDevices onnxruntime/onnxruntime-ep-tensorrt#4

Merged

chilo-ms marked this pull request as ready for review April 14, 2026 22:48

chilo-ms requested review from tianleiwu and yuslepukhin and removed request for yuslepukhin April 14, 2026 23:09

Merge branch 'main' into chi/add_pci_bus_id_as_metadata_for_windows

38957e0

tianleiwu reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PCI bus ID as device metadata on Windows#28066

Add PCI bus ID as device metadata on Windows#28066
chilo-ms wants to merge 3 commits intomainfrom
chi/add_pci_bus_id_as_metadata_for_windows

chilo-ms commented Apr 14, 2026 •

edited

Loading

Uh oh!

tianleiwu left a comment

Uh oh!

tianleiwu Apr 15, 2026

Uh oh!

tianleiwu Apr 15, 2026

Uh oh!

tianleiwu Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chilo-ms commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Summary

Positives

Concerns

Uh oh!

tianleiwu Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chilo-ms commented Apr 14, 2026 •

edited

Loading