Add PCI bus fallback for Linux GPU device discovery in containerized environments#27591
Add PCI bus fallback for Linux GPU device discovery in containerized environments#27591baijumeswani merged 12 commits intomainfrom
Conversation
In AKS/Kubernetes environments, the nvidia-drm kernel module may not be loaded, resulting in no /sys/class/drm/cardN entries for NVIDIA GPUs. This adds a fallback that scans /sys/bus/pci/devices/ for GPU-class PCI devices (VGA compatible controllers and 3D controllers) when DRM-based detection finds no GPUs. Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
…ment for card_idx Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a PCI-bus sysfs fallback to Linux GPU device discovery so GPU devices can still be surfaced (and EPs matched) in containerized environments where /sys/class/drm/cardN is absent (e.g., when nvidia-drm isn’t loaded).
Changes:
- Added PCI scanning of
/sys/bus/pci/devices(filtering display controllers by class/subclass) and creation ofOrtHardwareDevicerecords fromvendor/devicesysfs files. - Updated
GetGpuDevices()to attempt DRM-based discovery first and fall back to PCI scanning only when DRM yields zero GPUs. - Added a new device discovery test that validates basic GPU properties when GPUs are present.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| onnxruntime/core/platform/linux/device_discovery.cc | Implements PCI-bus fallback GPU discovery and integrates it into Linux GetGpuDevices() when DRM discovery returns empty. |
| onnxruntime/test/platform/device_discovery_test.cc | Adds a GPU discovery smoke/validation test to ensure GPU enumeration doesn’t crash and returns sane IDs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…bility The device_id field is not populated on Apple Silicon (defaults to 0), so asserting device_id != 0 would cause test failures on that platform. Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
…paths and add hermetic tests - Move PCI detection functions (DetectGpuPciPaths, GetGpuDeviceFromPci) from anonymous namespace to onnxruntime::pci_device_discovery namespace - Add device_discovery_linux.h header exposing PCI detection API for testing - Make DetectGpuPciPaths accept sysfs root path parameter for testability - Add comprehensive hermetic unit tests using fake sysfs directory structure: - VGA controller detection (class 0x0300) - 3D controller detection (class 0x0302) - Filtering out non-GPU PCI devices - Nonexistent/empty directory handling - Multiple GPU detection - Missing class file handling - Vendor/device ID reading and metadata population - NVIDIA vs non-NVIDIA discrete GPU metadata Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…m PCI path, use ASSERT_STATUS_OK, remove unnecessary test setup - Add reference link to PCI Code and ID Assignment Specification for class codes - Remove card_idx metadata from PCI fallback since directory_iterator order is unspecified and cannot match DRM's cardN ordering - Replace ASSERT_TRUE(status.IsOK()) with ASSERT_STATUS_OK() macro - In GetGpuDeviceFromPci tests, only create vendor/device files (not class) since those are the only files the function reads Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
There was a problem hiding this comment.
Overview
The PR implements a fallback to PCI bus scanning for GPU device discovery on Linux systems where DRM sysfs entries (/sys/class/drm) are unavailable. The implementation is clean, logically sound, and follows the existing repository patterns.
Summary
** Approve with Suggestion**
This PR represents a solid, well-tested improvement for GPU detection in containerized Linux environments (like AKS). It strongly adheres to memory safety guidelines, utilizes idiomatic C++ (no manual resource management), avoids ABI issues, and introduces comprehensive unit tests.
The implementation is robust for typical use cases. However, I have one primary suggestion regarding resilience: the PCI fallback loop currently fails the entire discovery phase if parsing metadata for a single device fails (e.g., temporary sysfs contention). I recommend adapting the loop to log a warning and continue, rather than fully aborting. Regardless, the core logic is sound and the PR is approved pending this minor consideration.
Review Focus Areas Alignment
Memory Management & Alignment
- Status: Passes.
- Notes: The implementation relies exclusively on modern C++ STL containers (
std::vector,std::string) and classes (std::filesystem::path). There are no raw memory allocations (malloc/new), conforming perfectly to the memory safety guidelines.
C-API & ABI Stability
- Status: Passes.
- Notes: The changes are contained entirely within internal C++ platform code (
device_discovery.cc,pci_device_discovery.h). No public C-API structs or ABI boundaries are affected.
Global State & Thread Safety
- Status: Passes.
- Notes: The introduced namespace functions and utility routines (
ReadValueFromFile,DetectGpuPciPaths,GetGpuDeviceFromPci) do not mutate local static or global states. Constant expressions and local isolated variables are correctly used.
Performance in the Hot Path
- Status: Passes.
- Notes: The filesystem scanning strictly occurs during environment discovery and session initialization, keeping the computation hot path completely free of heap allocations or file string manipulations.
Additional Suggestions
1. Robustness of GPU Fallback Discovery (Graceful Degradation)
Currently, in both the existing DRM discovery loop and the new PCI fallback loop, when populating gpu_devices:
for (const auto& gpu_pci_path_info : gpu_pci_path_infos) {
OrtHardwareDevice gpu_device{};
ORT_RETURN_IF_ERROR(pci_device_discovery::GetGpuDeviceFromPci(gpu_pci_path_info, gpu_device));
gpu_devices.emplace_back(std::move(gpu_device));
}If GetGpuDeviceFromPci fails for one of the identified devices (for instance if a vendor or device sysfs file is momentarily inaccessible due to unbinding, or format changes), ORT_RETURN_IF_ERROR propagates the error immediately, aborting the entire discovery process. This leads to DiscoverDevicesForPlatform failing to register any GPUs.
Suggestion: Consider logging a warning and continue-ing to the next device upon an individual device-parse failure, rather than failing the entire fallback scan. This would make device discovery much more resilient to unexpected system edge-cases where a single misbehaving PCIe device disables all valid devices.
2. File Iteration Exception Safety
When iterating over the directory_iterator with a range-based for loop:
for (const auto& dir_item : dir_iterator) {While you supply error_code during initialization (fs::directory_iterator{sysfs_pci_devices_path, error_code}), advancing the iterator implicitly calls iterator::operator++ without an error code, which can throw std::filesystem::filesystem_error if permissions or filesystem I/O errors are encountered mid-iteration.
This matches the existing pre-PR code pattern (DetectGpuSysfsPaths), but if strict exception safety is required in sysfs scans, manual iteration using .increment(ec) is safer.
…copilot/analyze-cuda-device-detection
|
@edgchen1 @tianleiwu Thank you for your review and comments. :) |
…environments (#27591) ### Description GPU device discovery on Linux relies exclusively on `/sys/class/drm/cardN` entries (DRM subsystem). In AKS/Kubernetes containers, `nvidia-drm` is typically not loaded—only the base NVIDIA driver is needed for CUDA compute. No DRM entries means no `OrtHardwareDevice` with `OrtHardwareDeviceType_GPU` is created, so `GetEpDevices` never matches the CUDA EP. Adds a fallback path in `GetGpuDevices()` that scans `/sys/bus/pci/devices/` when DRM yields zero GPUs: - **`DetectGpuPciPaths()`** — enumerates PCI devices, filters by class code `0x0300` (VGA) and `0x0302` (3D controller, used by NVIDIA datacenter GPUs) per the [PCI Code and ID Assignment Specification](https://pcisig.com/pci-code-and-id-assignment-specification-agreement) (base class 03h). Accepts an injectable sysfs root path for testability. - **`GetGpuDeviceFromPci()`** — reads `vendor`/`device` files directly from the PCI device sysfs path and populates `OrtHardwareDevice` with `pci_bus_id` and discrete GPU metadata. Note: `card_idx` is intentionally omitted from PCI-discovered devices since `directory_iterator` traversal order is unspecified and cannot be made consistent with DRM's `cardN` ordering. - **`GetGpuDevices()`** — tries DRM first; if empty, falls back to PCI scan The PCI detection functions are exposed via a new `onnxruntime::pci_device_discovery` namespace (declared in `core/platform/linux/pci_device_discovery.h`) so they can be tested hermetically with fake sysfs directories. The fallback only activates when DRM finds nothing, so no behavioral change on systems where DRM works. Also adds: - A cross-platform `GpuDevicesHaveValidProperties` test that validates GPU device type and vendor ID when GPUs are present. The test intentionally does not assert on `device_id` since some platforms (e.g., Apple Silicon) do not populate it. - Comprehensive hermetic Linux unit tests (`test/platform/linux/pci_device_discovery_test.cc`) that create fake sysfs directory structures to exercise the PCI fallback path, covering VGA/3D controller detection, non-GPU filtering, empty/missing paths, multiple GPUs, vendor/device ID reading, and NVIDIA discrete metadata. Tests use the `ASSERT_STATUS_OK()` macro from `test/util/include/asserts.h` and use `CreateFakePciDevice` to set up complete fake PCI device directories for both `DetectGpuPciPaths` and `GetGpuDeviceFromPci` tests. ### Motivation and Context CUDA EP registration fails on AKS (Azure Kubernetes Service) because the NVIDIA device plugin exposes GPUs via `/dev/nvidia*` and the NVIDIA driver, but does not load `nvidia-drm`. The existing `/sys/class/drm`-only detection path returns no GPU devices, blocking `GetEpDevices` from returning the CUDA EP. The same setup works on bare-metal Linux where DRM is loaded. <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
This cherry-picks the following commits for the release: | Commit ID | PR Number | Commit Title | |-----------|-----------|-------------| | eb23be8 | #27354 | Update python_requires | | d626b56 | #27479 | [QNN EP] Enable offline x64 compilation with memhandle IO type | | 60ce0e6 | #27607 | Use `_tpause` instead of `__builtin_ia32_tpause` | | 69feb84 | #27591 | Add PCI bus fallback for Linux GPU device discovery in containerized environments | | de92668 | #27650 | Revert "[QNN EP] Fix error messages being logged as VERBOSE instead o… | | 0f66526 | #27644 | [Plugin EP] Check for nullptr before dereferencing | | 929f73e | #27666 | Plugin EP: Fix bug that incorrectly assigned duplicate MetDef IDs to fused nodes in different GraphViews | --------- Co-authored-by: XXXXRT666 <157766680+XXXXRT666@users.noreply.github.com> Co-authored-by: derdeljan-msft <derdeljan@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Shogo Yamazaki <f9ifphmiz7i8akhowc8l5t1x9qp0lfu4@mocknen.net> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: baijumeswani <12852605+baijumeswani@users.noreply.github.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
On AWS EC2 vGPU instances (e.g. g5.xlarge with A10G), /sys/class/drm/card0 exists (simple-framebuffer for console VGA) but lacks device/vendor. GetGpuDeviceFromSysfs fails, and ORT_RETURN_IF_ERROR propagates the error immediately, preventing the PCI fallback at line 289 from ever running. The PCI path has the correct data: /sys/bus/pci/devices/0000:00:1e.0/vendor = 0x10de (NVIDIA) /sys/bus/pci/devices/0000:00:1e.0/class = 0x030200 (3D controller) Change the DRM loop to log a warning and continue instead of returning error, as suggested by @tianleiwu during the microsoft#27591 review. This allows the loop to skip non-GPU DRM entries (like simple-framebuffer) and fall through to the PCI fallback when no valid GPU is found via DRM. Fixes microsoft#27806
Description
GPU device discovery on Linux relies exclusively on
/sys/class/drm/cardNentries (DRM subsystem). In AKS/Kubernetes containers,nvidia-drmis typically not loaded—only the base NVIDIA driver is needed for CUDA compute. No DRM entries means noOrtHardwareDevicewithOrtHardwareDeviceType_GPUis created, soGetEpDevicesnever matches the CUDA EP.Adds a fallback path in
GetGpuDevices()that scans/sys/bus/pci/devices/when DRM yields zero GPUs:DetectGpuPciPaths()— enumerates PCI devices, filters by class code0x0300(VGA) and0x0302(3D controller, used by NVIDIA datacenter GPUs) per the PCI Code and ID Assignment Specification (base class 03h). Accepts an injectable sysfs root path for testability.GetGpuDeviceFromPci()— readsvendor/devicefiles directly from the PCI device sysfs path and populatesOrtHardwareDevicewithpci_bus_idand discrete GPU metadata. Note:card_idxis intentionally omitted from PCI-discovered devices sincedirectory_iteratortraversal order is unspecified and cannot be made consistent with DRM'scardNordering.GetGpuDevices()— tries DRM first; if empty, falls back to PCI scanThe PCI detection functions are exposed via a new
onnxruntime::pci_device_discoverynamespace (declared incore/platform/linux/pci_device_discovery.h) so they can be tested hermetically with fake sysfs directories.The fallback only activates when DRM finds nothing, so no behavioral change on systems where DRM works.
Also adds:
GpuDevicesHaveValidPropertiestest that validates GPU device type and vendor ID when GPUs are present. The test intentionally does not assert ondevice_idsince some platforms (e.g., Apple Silicon) do not populate it.test/platform/linux/pci_device_discovery_test.cc) that create fake sysfs directory structures to exercise the PCI fallback path, covering VGA/3D controller detection, non-GPU filtering, empty/missing paths, multiple GPUs, vendor/device ID reading, and NVIDIA discrete metadata. Tests use theASSERT_STATUS_OK()macro fromtest/util/include/asserts.hand useCreateFakePciDeviceto set up complete fake PCI device directories for bothDetectGpuPciPathsandGetGpuDeviceFromPcitests.Motivation and Context
CUDA EP registration fails on AKS (Azure Kubernetes Service) because the NVIDIA device plugin exposes GPUs via
/dev/nvidia*and the NVIDIA driver, but does not loadnvidia-drm. The existing/sys/class/drm-only detection path returns no GPU devices, blockingGetEpDevicesfrom returning the CUDA EP. The same setup works on bare-metal Linux where DRM is loaded.💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.