Add PCI bus fallback for Linux GPU device discovery in containerized environments by Copilot · Pull Request #27591 · microsoft/onnxruntime

Copilot · 2026-03-09T16:14:00Z

Description

GPU device discovery on Linux relies exclusively on /sys/class/drm/cardN entries (DRM subsystem). In AKS/Kubernetes containers, nvidia-drm is typically not loaded—only the base NVIDIA driver is needed for CUDA compute. No DRM entries means no OrtHardwareDevice with OrtHardwareDeviceType_GPU is created, so GetEpDevices never matches the CUDA EP.

Adds a fallback path in GetGpuDevices() that scans /sys/bus/pci/devices/ when DRM yields zero GPUs:

DetectGpuPciPaths() — enumerates PCI devices, filters by class code 0x0300 (VGA) and 0x0302 (3D controller, used by NVIDIA datacenter GPUs) per the PCI Code and ID Assignment Specification (base class 03h). Accepts an injectable sysfs root path for testability.
GetGpuDeviceFromPci() — reads vendor/device files directly from the PCI device sysfs path and populates OrtHardwareDevice with pci_bus_id and discrete GPU metadata. Note: card_idx is intentionally omitted from PCI-discovered devices since directory_iterator traversal order is unspecified and cannot be made consistent with DRM's cardN ordering.
GetGpuDevices() — tries DRM first; if empty, falls back to PCI scan

The PCI detection functions are exposed via a new onnxruntime::pci_device_discovery namespace (declared in core/platform/linux/pci_device_discovery.h) so they can be tested hermetically with fake sysfs directories.

The fallback only activates when DRM finds nothing, so no behavioral change on systems where DRM works.

Also adds:

A cross-platform GpuDevicesHaveValidProperties test that validates GPU device type and vendor ID when GPUs are present. The test intentionally does not assert on device_id since some platforms (e.g., Apple Silicon) do not populate it.
Comprehensive hermetic Linux unit tests (test/platform/linux/pci_device_discovery_test.cc) that create fake sysfs directory structures to exercise the PCI fallback path, covering VGA/3D controller detection, non-GPU filtering, empty/missing paths, multiple GPUs, vendor/device ID reading, and NVIDIA discrete metadata. Tests use the ASSERT_STATUS_OK() macro from test/util/include/asserts.h and use CreateFakePciDevice to set up complete fake PCI device directories for both DetectGpuPciPaths and GetGpuDeviceFromPci tests.

Motivation and Context

CUDA EP registration fails on AKS (Azure Kubernetes Service) because the NVIDIA device plugin exposes GPUs via /dev/nvidia* and the NVIDIA driver, but does not load nvidia-drm. The existing /sys/class/drm-only detection path returns no GPU devices, blocking GetEpDevices from returning the CUDA EP. The same setup works on bare-metal Linux where DRM is loaded.

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

In AKS/Kubernetes environments, the nvidia-drm kernel module may not be loaded, resulting in no /sys/class/drm/cardN entries for NVIDIA GPUs. This adds a fallback that scans /sys/bus/pci/devices/ for GPU-class PCI devices (VGA compatible controllers and 3D controllers) when DRM-based detection finds no GPUs. Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>

…ment for card_idx Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>

Copilot

Pull request overview

Adds a PCI-bus sysfs fallback to Linux GPU device discovery so GPU devices can still be surfaced (and EPs matched) in containerized environments where /sys/class/drm/cardN is absent (e.g., when nvidia-drm isn’t loaded).

Changes:

Added PCI scanning of /sys/bus/pci/devices (filtering display controllers by class/subclass) and creation of OrtHardwareDevice records from vendor/device sysfs files.
Updated GetGpuDevices() to attempt DRM-based discovery first and fall back to PCI scanning only when DRM yields zero GPUs.
Added a new device discovery test that validates basic GPU properties when GPUs are present.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
onnxruntime/core/platform/linux/device_discovery.cc	Implements PCI-bus fallback GPU discovery and integrates it into Linux `GetGpuDevices()` when DRM discovery returns empty.
onnxruntime/test/platform/device_discovery_test.cc	Adds a GPU discovery smoke/validation test to ensure GPU enumeration doesn’t crash and returns sane IDs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/test/platform/device_discovery_test.cc

onnxruntime/core/platform/linux/device_discovery.cc

…bility The device_id field is not populated on Apple Silicon (defaults to 0), so asserting device_id != 0 would cause test failures on that platform. Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>

…paths and add hermetic tests - Move PCI detection functions (DetectGpuPciPaths, GetGpuDeviceFromPci) from anonymous namespace to onnxruntime::pci_device_discovery namespace - Add device_discovery_linux.h header exposing PCI detection API for testing - Make DetectGpuPciPaths accept sysfs root path parameter for testability - Add comprehensive hermetic unit tests using fake sysfs directory structure: - VGA controller detection (class 0x0300) - 3D controller detection (class 0x0302) - Filtering out non-GPU PCI devices - Nonexistent/empty directory handling - Multiple GPU detection - Missing class file handling - Vendor/device ID reading and metadata population - NVIDIA vs non-NVIDIA discrete GPU metadata Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>

Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

edgchen1

@copilot please address the comments in this review

onnxruntime/core/platform/linux/device_discovery.cc

onnxruntime/test/platform/linux/pci_device_discovery_test.cc

…m PCI path, use ASSERT_STATUS_OK, remove unnecessary test setup - Add reference link to PCI Code and ID Assignment Specification for class codes - Remove card_idx metadata from PCI fallback since directory_iterator order is unspecified and cannot match DRM's cardN ordering - Replace ASSERT_TRUE(status.IsOK()) with ASSERT_STATUS_OK() macro - In GetGpuDeviceFromPci tests, only create vendor/device files (not class) since those are the only files the function reads Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>

Copilot · 2026-03-10T19:36:49Z

@copilot please address the comments in this review

All review comments addressed in commit 1dacf9f.

onnxruntime/core/platform/linux/pci_device_discovery.h

onnxruntime/core/platform/linux/device_discovery.cc

Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>

onnxruntime/test/platform/device_discovery_test.cc

tianleiwu

Overview

The PR implements a fallback to PCI bus scanning for GPU device discovery on Linux systems where DRM sysfs entries (/sys/class/drm) are unavailable. The implementation is clean, logically sound, and follows the existing repository patterns.

Summary

** Approve with Suggestion**

This PR represents a solid, well-tested improvement for GPU detection in containerized Linux environments (like AKS). It strongly adheres to memory safety guidelines, utilizes idiomatic C++ (no manual resource management), avoids ABI issues, and introduces comprehensive unit tests.

The implementation is robust for typical use cases. However, I have one primary suggestion regarding resilience: the PCI fallback loop currently fails the entire discovery phase if parsing metadata for a single device fails (e.g., temporary sysfs contention). I recommend adapting the loop to log a warning and continue, rather than fully aborting. Regardless, the core logic is sound and the PR is approved pending this minor consideration.

Review Focus Areas Alignment

Memory Management & Alignment

Status: Passes.
Notes: The implementation relies exclusively on modern C++ STL containers (std::vector, std::string) and classes (std::filesystem::path). There are no raw memory allocations (malloc/new), conforming perfectly to the memory safety guidelines.

C-API & ABI Stability

Status: Passes.
Notes: The changes are contained entirely within internal C++ platform code (device_discovery.cc, pci_device_discovery.h). No public C-API structs or ABI boundaries are affected.

Global State & Thread Safety

Status: Passes.
Notes: The introduced namespace functions and utility routines (ReadValueFromFile, DetectGpuPciPaths, GetGpuDeviceFromPci) do not mutate local static or global states. Constant expressions and local isolated variables are correctly used.

Performance in the Hot Path

Status: Passes.
Notes: The filesystem scanning strictly occurs during environment discovery and session initialization, keeping the computation hot path completely free of heap allocations or file string manipulations.

Additional Suggestions

1. Robustness of GPU Fallback Discovery (Graceful Degradation)

Currently, in both the existing DRM discovery loop and the new PCI fallback loop, when populating gpu_devices:

for (const auto& gpu_pci_path_info : gpu_pci_path_infos) {
  OrtHardwareDevice gpu_device{};
  ORT_RETURN_IF_ERROR(pci_device_discovery::GetGpuDeviceFromPci(gpu_pci_path_info, gpu_device));
  gpu_devices.emplace_back(std::move(gpu_device));
}

If GetGpuDeviceFromPci fails for one of the identified devices (for instance if a vendor or device sysfs file is momentarily inaccessible due to unbinding, or format changes), ORT_RETURN_IF_ERROR propagates the error immediately, aborting the entire discovery process. This leads to DiscoverDevicesForPlatform failing to register any GPUs.
Suggestion: Consider logging a warning and continue-ing to the next device upon an individual device-parse failure, rather than failing the entire fallback scan. This would make device discovery much more resilient to unexpected system edge-cases where a single misbehaving PCIe device disables all valid devices.

2. File Iteration Exception Safety

When iterating over the directory_iterator with a range-based for loop:

for (const auto& dir_item : dir_iterator) {

While you supply error_code during initialization (fs::directory_iterator{sysfs_pci_devices_path, error_code}), advancing the iterator implicitly calls iterator::operator++ without an error code, which can throw std::filesystem::filesystem_error if permissions or filesystem I/O errors are encountered mid-iteration.
This matches the existing pre-PR code pattern (DetectGpuSysfsPaths), but if strict exception safety is required in sysfs scans, manual iteration using .increment(ec) is safer.

…copilot/analyze-cuda-device-detection

baijumeswani · 2026-03-12T18:39:36Z

@edgchen1 @tianleiwu Thank you for your review and comments. :)

…environments (#27591) ### Description GPU device discovery on Linux relies exclusively on `/sys/class/drm/cardN` entries (DRM subsystem). In AKS/Kubernetes containers, `nvidia-drm` is typically not loaded—only the base NVIDIA driver is needed for CUDA compute. No DRM entries means no `OrtHardwareDevice` with `OrtHardwareDeviceType_GPU` is created, so `GetEpDevices` never matches the CUDA EP. Adds a fallback path in `GetGpuDevices()` that scans `/sys/bus/pci/devices/` when DRM yields zero GPUs: - **`DetectGpuPciPaths()`** — enumerates PCI devices, filters by class code `0x0300` (VGA) and `0x0302` (3D controller, used by NVIDIA datacenter GPUs) per the [PCI Code and ID Assignment Specification](https://pcisig.com/pci-code-and-id-assignment-specification-agreement) (base class 03h). Accepts an injectable sysfs root path for testability. - **`GetGpuDeviceFromPci()`** — reads `vendor`/`device` files directly from the PCI device sysfs path and populates `OrtHardwareDevice` with `pci_bus_id` and discrete GPU metadata. Note: `card_idx` is intentionally omitted from PCI-discovered devices since `directory_iterator` traversal order is unspecified and cannot be made consistent with DRM's `cardN` ordering. - **`GetGpuDevices()`** — tries DRM first; if empty, falls back to PCI scan The PCI detection functions are exposed via a new `onnxruntime::pci_device_discovery` namespace (declared in `core/platform/linux/pci_device_discovery.h`) so they can be tested hermetically with fake sysfs directories. The fallback only activates when DRM finds nothing, so no behavioral change on systems where DRM works. Also adds: - A cross-platform `GpuDevicesHaveValidProperties` test that validates GPU device type and vendor ID when GPUs are present. The test intentionally does not assert on `device_id` since some platforms (e.g., Apple Silicon) do not populate it. - Comprehensive hermetic Linux unit tests (`test/platform/linux/pci_device_discovery_test.cc`) that create fake sysfs directory structures to exercise the PCI fallback path, covering VGA/3D controller detection, non-GPU filtering, empty/missing paths, multiple GPUs, vendor/device ID reading, and NVIDIA discrete metadata. Tests use the `ASSERT_STATUS_OK()` macro from `test/util/include/asserts.h` and use `CreateFakePciDevice` to set up complete fake PCI device directories for both `DetectGpuPciPaths` and `GetGpuDeviceFromPci` tests. ### Motivation and Context CUDA EP registration fails on AKS (Azure Kubernetes Service) because the NVIDIA device plugin exposes GPUs via `/dev/nvidia*` and the NVIDIA driver, but does not load `nvidia-drm`. The existing `/sys/class/drm`-only detection path returns no GPU devices, blocking `GetEpDevices` from returning the CUDA EP. The same setup works on bare-metal Linux where DRM is loaded.  --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>

This cherry-picks the following commits for the release: | Commit ID | PR Number | Commit Title | |-----------|-----------|-------------| | eb23be8 | #27354 | Update python_requires | | d626b56 | #27479 | [QNN EP] Enable offline x64 compilation with memhandle IO type | | 60ce0e6 | #27607 | Use `_tpause` instead of `__builtin_ia32_tpause` | | 69feb84 | #27591 | Add PCI bus fallback for Linux GPU device discovery in containerized environments | | de92668 | #27650 | Revert "[QNN EP] Fix error messages being logged as VERBOSE instead o… | | 0f66526 | #27644 | [Plugin EP] Check for nullptr before dereferencing | | 929f73e | #27666 | Plugin EP: Fix bug that incorrectly assigned duplicate MetDef IDs to fused nodes in different GraphViews | --------- Co-authored-by: XXXXRT666 <157766680+XXXXRT666@users.noreply.github.com> Co-authored-by: derdeljan-msft <derdeljan@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Shogo Yamazaki <f9ifphmiz7i8akhowc8l5t1x9qp0lfu4@mocknen.net> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>

@tianleiwu

On AWS EC2 vGPU instances (e.g. g5.xlarge with A10G), /sys/class/drm/card0 exists (simple-framebuffer for console VGA) but lacks device/vendor. GetGpuDeviceFromSysfs fails, and ORT_RETURN_IF_ERROR propagates the error immediately, preventing the PCI fallback at line 289 from ever running. The PCI path has the correct data: /sys/bus/pci/devices/0000:00:1e.0/vendor = 0x10de (NVIDIA) /sys/bus/pci/devices/0000:00:1e.0/class = 0x030200 (3D controller) Change the DRM loop to log a warning and continue instead of returning error, as suggested by @tianleiwu during the microsoft#27591 review. This allows the loop to skip non-GPU DRM entries (like simple-framebuffer) and fall through to the PCI fallback when no valid GPU is found via DRM. Fixes microsoft#27806

Initial plan

ca28284

Copilot AI assigned Copilot and baijumeswani Mar 9, 2026

Copilot started work on behalf of baijumeswani March 9, 2026 16:14 View session

Copilot AI and others added 2 commits March 9, 2026 16:20

Address review feedback: use VERBOSE log level and add clarifying com…

785535e

…ment for card_idx Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Analyze potential issues with cuda execution provider in AKS~~ Add PCI bus fallback for Linux GPU device discovery in containerized environments Mar 9, 2026

Copilot finished work on behalf of baijumeswani March 9, 2026 16:23

baijumeswani requested a review from Copilot March 9, 2026 17:20

Copilot started reviewing on behalf of baijumeswani March 9, 2026 17:21 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

onnxruntime/test/platform/device_discovery_test.cc Show resolved Hide resolved

onnxruntime/core/platform/linux/device_discovery.cc Show resolved Hide resolved

Copilot started work on behalf of baijumeswani March 9, 2026 17:26 View session

Copilot finished work on behalf of baijumeswani March 9, 2026 17:30

Copilot started work on behalf of baijumeswani March 9, 2026 17:31 View session

Copilot AI and others added 2 commits March 9, 2026 17:37

Extract hardcoded sysfs PCI path to a named constant

09ea516

Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>

Copilot finished work on behalf of baijumeswani March 9, 2026 17:39

baijumeswani marked this pull request as ready for review March 10, 2026 18:21

baijumeswani requested a review from Copilot March 10, 2026 18:21

Copilot started reviewing on behalf of baijumeswani March 10, 2026 18:22 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

baijumeswani requested a review from edgchen1 March 10, 2026 18:29

edgchen1 reviewed Mar 10, 2026

View reviewed changes

Copilot started work on behalf of edgchen1 March 10, 2026 19:31 View session

Copilot finished work on behalf of edgchen1 March 10, 2026 19:37

edgchen1 reviewed Mar 10, 2026

View reviewed changes

onnxruntime/core/platform/linux/pci_device_discovery.h Show resolved Hide resolved

Copilot started work on behalf of edgchen1 March 10, 2026 19:41 View session

Copilot finished work on behalf of edgchen1 March 10, 2026 19:44

edgchen1 reviewed Mar 10, 2026

View reviewed changes

onnxruntime/core/platform/linux/device_discovery.cc Outdated Show resolved Hide resolved

Fix reference link.

6ddce68

Copilot started work on behalf of edgchen1 March 10, 2026 21:12 View session

Restore CreateFakePciDevice calls in GetGpuDeviceFromPci tests

437da85

Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>

edgchen1 reviewed Mar 10, 2026

View reviewed changes

onnxruntime/test/platform/device_discovery_test.cc Outdated Show resolved Hide resolved

Apply suggestion from @edgchen1

fb5b0a0

Copilot finished work on behalf of edgchen1 March 10, 2026 21:16

edgchen1 approved these changes Mar 10, 2026

View reviewed changes

tianleiwu approved these changes Mar 10, 2026

View reviewed changes

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

5d7db8b

…copilot/analyze-cuda-device-detection

baijumeswani enabled auto-merge (squash) March 12, 2026 06:27

baijumeswani merged commit 69feb84 into main Mar 12, 2026
94 of 95 checks passed

baijumeswani deleted the copilot/analyze-cuda-device-detection branch March 12, 2026 17:55

tianleiwu added the release:1.24.4 label Mar 12, 2026

tianleiwu mentioned this pull request Mar 16, 2026

ORT 1.24.4 release cherry pick round 1 #27682

Merged

f-dy mentioned this pull request Mar 23, 2026

PCI fallback unreachable on AWS EC2 vGPU — DRM loop error propagation bypasses fallback (#27591) #27806

Open

f-dy mentioned this pull request Mar 23, 2026

Fix DRM loop error propagation bypassing PCI fallback on vGPU instances #27810

Open

dependabot bot mentioned this pull request Apr 6, 2026

deps(nuget): Bump Microsoft.AspNetCore.Authentication.JwtBearer and 10 others Ellerbach/azure-ai-search-simulator#92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PCI bus fallback for Linux GPU device discovery in containerized environments#27591

Add PCI bus fallback for Linux GPU device discovery in containerized environments#27591
baijumeswani merged 12 commits intomainfrom
copilot/analyze-cuda-device-detection

Copilot AI commented Mar 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

edgchen1 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment •

edited

Loading

Uh oh!

Uh oh!

baijumeswani commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Copilot AI commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

edgchen1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Overview

Summary

Review Focus Areas Alignment

Memory Management & Alignment

C-API & ABI Stability

Global State & Thread Safety

Performance in the Hot Path

Additional Suggestions

1. Robustness of GPU Fallback Discovery (Graceful Degradation)

2. File Iteration Exception Safety

Uh oh!

Uh oh!

baijumeswani commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Copilot AI commented Mar 9, 2026 •

edited

Loading

tianleiwu left a comment •

edited

Loading