Skip to content

Add PCI bus fallback for Linux GPU device discovery in containerized environments#27591

Merged
baijumeswani merged 12 commits intomainfrom
copilot/analyze-cuda-device-detection
Mar 12, 2026
Merged

Add PCI bus fallback for Linux GPU device discovery in containerized environments#27591
baijumeswani merged 12 commits intomainfrom
copilot/analyze-cuda-device-detection

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 9, 2026

Description

GPU device discovery on Linux relies exclusively on /sys/class/drm/cardN entries (DRM subsystem). In AKS/Kubernetes containers, nvidia-drm is typically not loaded—only the base NVIDIA driver is needed for CUDA compute. No DRM entries means no OrtHardwareDevice with OrtHardwareDeviceType_GPU is created, so GetEpDevices never matches the CUDA EP.

Adds a fallback path in GetGpuDevices() that scans /sys/bus/pci/devices/ when DRM yields zero GPUs:

  • DetectGpuPciPaths() — enumerates PCI devices, filters by class code 0x0300 (VGA) and 0x0302 (3D controller, used by NVIDIA datacenter GPUs) per the PCI Code and ID Assignment Specification (base class 03h). Accepts an injectable sysfs root path for testability.
  • GetGpuDeviceFromPci() — reads vendor/device files directly from the PCI device sysfs path and populates OrtHardwareDevice with pci_bus_id and discrete GPU metadata. Note: card_idx is intentionally omitted from PCI-discovered devices since directory_iterator traversal order is unspecified and cannot be made consistent with DRM's cardN ordering.
  • GetGpuDevices() — tries DRM first; if empty, falls back to PCI scan

The PCI detection functions are exposed via a new onnxruntime::pci_device_discovery namespace (declared in core/platform/linux/pci_device_discovery.h) so they can be tested hermetically with fake sysfs directories.

The fallback only activates when DRM finds nothing, so no behavioral change on systems where DRM works.

Also adds:

  • A cross-platform GpuDevicesHaveValidProperties test that validates GPU device type and vendor ID when GPUs are present. The test intentionally does not assert on device_id since some platforms (e.g., Apple Silicon) do not populate it.
  • Comprehensive hermetic Linux unit tests (test/platform/linux/pci_device_discovery_test.cc) that create fake sysfs directory structures to exercise the PCI fallback path, covering VGA/3D controller detection, non-GPU filtering, empty/missing paths, multiple GPUs, vendor/device ID reading, and NVIDIA discrete metadata. Tests use the ASSERT_STATUS_OK() macro from test/util/include/asserts.h and use CreateFakePciDevice to set up complete fake PCI device directories for both DetectGpuPciPaths and GetGpuDeviceFromPci tests.

Motivation and Context

CUDA EP registration fails on AKS (Azure Kubernetes Service) because the NVIDIA device plugin exposes GPUs via /dev/nvidia* and the NVIDIA driver, but does not load nvidia-drm. The existing /sys/class/drm-only detection path returns no GPU devices, blocking GetEpDevices from returning the CUDA EP. The same setup works on bare-metal Linux where DRM is loaded.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits March 9, 2026 16:20
In AKS/Kubernetes environments, the nvidia-drm kernel module may not be
loaded, resulting in no /sys/class/drm/cardN entries for NVIDIA GPUs.
This adds a fallback that scans /sys/bus/pci/devices/ for GPU-class PCI
devices (VGA compatible controllers and 3D controllers) when DRM-based
detection finds no GPUs.

Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
…ment for card_idx

Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
Copilot AI changed the title [WIP] Analyze potential issues with cuda execution provider in AKS Add PCI bus fallback for Linux GPU device discovery in containerized environments Mar 9, 2026
@baijumeswani baijumeswani requested a review from Copilot March 9, 2026 17:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a PCI-bus sysfs fallback to Linux GPU device discovery so GPU devices can still be surfaced (and EPs matched) in containerized environments where /sys/class/drm/cardN is absent (e.g., when nvidia-drm isn’t loaded).

Changes:

  • Added PCI scanning of /sys/bus/pci/devices (filtering display controllers by class/subclass) and creation of OrtHardwareDevice records from vendor/device sysfs files.
  • Updated GetGpuDevices() to attempt DRM-based discovery first and fall back to PCI scanning only when DRM yields zero GPUs.
  • Added a new device discovery test that validates basic GPU properties when GPUs are present.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
onnxruntime/core/platform/linux/device_discovery.cc Implements PCI-bus fallback GPU discovery and integrates it into Linux GetGpuDevices() when DRM discovery returns empty.
onnxruntime/test/platform/device_discovery_test.cc Adds a GPU discovery smoke/validation test to ensure GPU enumeration doesn’t crash and returns sane IDs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…bility

The device_id field is not populated on Apple Silicon (defaults to 0),
so asserting device_id != 0 would cause test failures on that platform.

Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
Copilot AI and others added 2 commits March 9, 2026 17:37
…paths and add hermetic tests

- Move PCI detection functions (DetectGpuPciPaths, GetGpuDeviceFromPci) from
  anonymous namespace to onnxruntime::pci_device_discovery namespace
- Add device_discovery_linux.h header exposing PCI detection API for testing
- Make DetectGpuPciPaths accept sysfs root path parameter for testability
- Add comprehensive hermetic unit tests using fake sysfs directory structure:
  - VGA controller detection (class 0x0300)
  - 3D controller detection (class 0x0302)
  - Filtering out non-GPU PCI devices
  - Nonexistent/empty directory handling
  - Multiple GPU detection
  - Missing class file handling
  - Vendor/device ID reading and metadata population
  - NVIDIA vs non-NVIDIA discrete GPU metadata

Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
@baijumeswani baijumeswani marked this pull request as ready for review March 10, 2026 18:21
@baijumeswani baijumeswani requested a review from Copilot March 10, 2026 18:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@baijumeswani baijumeswani requested a review from edgchen1 March 10, 2026 18:29
Copy link
Copy Markdown
Contributor

@edgchen1 edgchen1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please address the comments in this review

…m PCI path, use ASSERT_STATUS_OK, remove unnecessary test setup

- Add reference link to PCI Code and ID Assignment Specification for class codes
- Remove card_idx metadata from PCI fallback since directory_iterator order
  is unspecified and cannot match DRM's cardN ordering
- Replace ASSERT_TRUE(status.IsOK()) with ASSERT_STATUS_OK() macro
- In GetGpuDeviceFromPci tests, only create vendor/device files (not class)
  since those are the only files the function reads

Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 10, 2026

@copilot please address the comments in this review

All review comments addressed in commit 1dacf9f.

Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

The PR implements a fallback to PCI bus scanning for GPU device discovery on Linux systems where DRM sysfs entries (/sys/class/drm) are unavailable. The implementation is clean, logically sound, and follows the existing repository patterns.

Summary

** Approve with Suggestion**

This PR represents a solid, well-tested improvement for GPU detection in containerized Linux environments (like AKS). It strongly adheres to memory safety guidelines, utilizes idiomatic C++ (no manual resource management), avoids ABI issues, and introduces comprehensive unit tests.

The implementation is robust for typical use cases. However, I have one primary suggestion regarding resilience: the PCI fallback loop currently fails the entire discovery phase if parsing metadata for a single device fails (e.g., temporary sysfs contention). I recommend adapting the loop to log a warning and continue, rather than fully aborting. Regardless, the core logic is sound and the PR is approved pending this minor consideration.

Review Focus Areas Alignment

Memory Management & Alignment

  • Status: Passes.
  • Notes: The implementation relies exclusively on modern C++ STL containers (std::vector, std::string) and classes (std::filesystem::path). There are no raw memory allocations (malloc/new), conforming perfectly to the memory safety guidelines.

C-API & ABI Stability

  • Status: Passes.
  • Notes: The changes are contained entirely within internal C++ platform code (device_discovery.cc, pci_device_discovery.h). No public C-API structs or ABI boundaries are affected.

Global State & Thread Safety

  • Status: Passes.
  • Notes: The introduced namespace functions and utility routines (ReadValueFromFile, DetectGpuPciPaths, GetGpuDeviceFromPci) do not mutate local static or global states. Constant expressions and local isolated variables are correctly used.

Performance in the Hot Path

  • Status: Passes.
  • Notes: The filesystem scanning strictly occurs during environment discovery and session initialization, keeping the computation hot path completely free of heap allocations or file string manipulations.

Additional Suggestions

1. Robustness of GPU Fallback Discovery (Graceful Degradation)

Currently, in both the existing DRM discovery loop and the new PCI fallback loop, when populating gpu_devices:

for (const auto& gpu_pci_path_info : gpu_pci_path_infos) {
  OrtHardwareDevice gpu_device{};
  ORT_RETURN_IF_ERROR(pci_device_discovery::GetGpuDeviceFromPci(gpu_pci_path_info, gpu_device));
  gpu_devices.emplace_back(std::move(gpu_device));
}

If GetGpuDeviceFromPci fails for one of the identified devices (for instance if a vendor or device sysfs file is momentarily inaccessible due to unbinding, or format changes), ORT_RETURN_IF_ERROR propagates the error immediately, aborting the entire discovery process. This leads to DiscoverDevicesForPlatform failing to register any GPUs.
Suggestion: Consider logging a warning and continue-ing to the next device upon an individual device-parse failure, rather than failing the entire fallback scan. This would make device discovery much more resilient to unexpected system edge-cases where a single misbehaving PCIe device disables all valid devices.

2. File Iteration Exception Safety

When iterating over the directory_iterator with a range-based for loop:

for (const auto& dir_item : dir_iterator) {

While you supply error_code during initialization (fs::directory_iterator{sysfs_pci_devices_path, error_code}), advancing the iterator implicitly calls iterator::operator++ without an error code, which can throw std::filesystem::filesystem_error if permissions or filesystem I/O errors are encountered mid-iteration.
This matches the existing pre-PR code pattern (DetectGpuSysfsPaths), but if strict exception safety is required in sysfs scans, manual iteration using .increment(ec) is safer.

@baijumeswani baijumeswani enabled auto-merge (squash) March 12, 2026 06:27
@baijumeswani baijumeswani merged commit 69feb84 into main Mar 12, 2026
94 of 95 checks passed
@baijumeswani baijumeswani deleted the copilot/analyze-cuda-device-detection branch March 12, 2026 17:55
@baijumeswani
Copy link
Copy Markdown
Contributor

@edgchen1 @tianleiwu Thank you for your review and comments. :)

tianleiwu pushed a commit that referenced this pull request Mar 16, 2026
…environments (#27591)

### Description

GPU device discovery on Linux relies exclusively on
`/sys/class/drm/cardN` entries (DRM subsystem). In AKS/Kubernetes
containers, `nvidia-drm` is typically not loaded—only the base NVIDIA
driver is needed for CUDA compute. No DRM entries means no
`OrtHardwareDevice` with `OrtHardwareDeviceType_GPU` is created, so
`GetEpDevices` never matches the CUDA EP.

Adds a fallback path in `GetGpuDevices()` that scans
`/sys/bus/pci/devices/` when DRM yields zero GPUs:

- **`DetectGpuPciPaths()`** — enumerates PCI devices, filters by class
code `0x0300` (VGA) and `0x0302` (3D controller, used by NVIDIA
datacenter GPUs) per the [PCI Code and ID Assignment
Specification](https://pcisig.com/pci-code-and-id-assignment-specification-agreement)
(base class 03h). Accepts an injectable sysfs root path for testability.
- **`GetGpuDeviceFromPci()`** — reads `vendor`/`device` files directly
from the PCI device sysfs path and populates `OrtHardwareDevice` with
`pci_bus_id` and discrete GPU metadata. Note: `card_idx` is
intentionally omitted from PCI-discovered devices since
`directory_iterator` traversal order is unspecified and cannot be made
consistent with DRM's `cardN` ordering.
- **`GetGpuDevices()`** — tries DRM first; if empty, falls back to PCI
scan

The PCI detection functions are exposed via a new
`onnxruntime::pci_device_discovery` namespace (declared in
`core/platform/linux/pci_device_discovery.h`) so they can be tested
hermetically with fake sysfs directories.

The fallback only activates when DRM finds nothing, so no behavioral
change on systems where DRM works.

Also adds:
- A cross-platform `GpuDevicesHaveValidProperties` test that validates
GPU device type and vendor ID when GPUs are present. The test
intentionally does not assert on `device_id` since some platforms (e.g.,
Apple Silicon) do not populate it.
- Comprehensive hermetic Linux unit tests
(`test/platform/linux/pci_device_discovery_test.cc`) that create fake
sysfs directory structures to exercise the PCI fallback path, covering
VGA/3D controller detection, non-GPU filtering, empty/missing paths,
multiple GPUs, vendor/device ID reading, and NVIDIA discrete metadata.
Tests use the `ASSERT_STATUS_OK()` macro from
`test/util/include/asserts.h` and use `CreateFakePciDevice` to set up
complete fake PCI device directories for both `DetectGpuPciPaths` and
`GetGpuDeviceFromPci` tests.

### Motivation and Context

CUDA EP registration fails on AKS (Azure Kubernetes Service) because the
NVIDIA device plugin exposes GPUs via `/dev/nvidia*` and the NVIDIA
driver, but does not load `nvidia-drm`. The existing
`/sys/class/drm`-only detection path returns no GPU devices, blocking
`GetEpDevices` from returning the CUDA EP. The same setup works on
bare-metal Linux where DRM is loaded.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
tianleiwu added a commit that referenced this pull request Mar 16, 2026
This cherry-picks the following commits for the release:

| Commit ID | PR Number | Commit Title |
|-----------|-----------|-------------|
| eb23be8 | #27354 | Update python_requires |
| d626b56 | #27479 | [QNN EP] Enable offline x64 compilation with
memhandle IO type |
| 60ce0e6 | #27607 | Use `_tpause` instead of `__builtin_ia32_tpause`
|
| 69feb84 | #27591 | Add PCI bus fallback for Linux GPU device
discovery in containerized environments |
| de92668 | #27650 | Revert "[QNN EP] Fix error messages being logged
as VERBOSE instead o… |
| 0f66526 | #27644 | [Plugin EP] Check for nullptr before
dereferencing |
| 929f73e | #27666 | Plugin EP: Fix bug that incorrectly assigned
duplicate MetDef IDs to fused nodes in different GraphViews |

---------

Co-authored-by: XXXXRT666 <157766680+XXXXRT666@users.noreply.github.com>
Co-authored-by: derdeljan-msft <derdeljan@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Shogo Yamazaki <f9ifphmiz7i8akhowc8l5t1x9qp0lfu4@mocknen.net>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: Artur Wojcik <artur.wojcik@amd.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
f-dy pushed a commit to f-dy/onnxruntime that referenced this pull request Mar 23, 2026
On AWS EC2 vGPU instances (e.g. g5.xlarge with A10G), /sys/class/drm/card0
exists (simple-framebuffer for console VGA) but lacks device/vendor.
GetGpuDeviceFromSysfs fails, and ORT_RETURN_IF_ERROR propagates the error
immediately, preventing the PCI fallback at line 289 from ever running.

The PCI path has the correct data:
  /sys/bus/pci/devices/0000:00:1e.0/vendor = 0x10de (NVIDIA)
  /sys/bus/pci/devices/0000:00:1e.0/class  = 0x030200 (3D controller)

Change the DRM loop to log a warning and continue instead of returning
error, as suggested by @tianleiwu during the microsoft#27591 review. This allows
the loop to skip non-GPU DRM entries (like simple-framebuffer) and fall
through to the PCI fallback when no valid GPU is found via DRM.

Fixes microsoft#27806
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants