Fix DRM loop error propagation bypassing PCI fallback on vGPU instances by f-dy · Pull Request #27810 · microsoft/onnxruntime

f-dy · 2026-03-23T16:10:39Z

Description

On AWS EC2 vGPU instances (e.g. g5.xlarge with A10G), /sys/class/drm/card0 exists as a simple-framebuffer for console VGA but lacks device/vendor. GetGpuDeviceFromSysfs fails, and ORT_RETURN_IF_ERROR propagates the error immediately, preventing the PCI fallback (added in #27591) from ever running.

The PCI path has the correct data:

/sys/bus/pci/devices/0000:00:1e.0/vendor = 0x10de  (NVIDIA)
/sys/bus/pci/devices/0000:00:1e.0/class  = 0x030200 (3D controller)

Change

Change the DRM loop in GetGpuDevices() to log a warning and continue instead of ORT_RETURN_IF_ERROR, as suggested by @tianleiwu during the #27591 review. This allows the loop to skip non-GPU DRM entries (like simple-framebuffer) and fall through to the PCI fallback when no valid GPU is found via DRM.

Before:

ORT_RETURN_IF_ERROR(GetGpuDeviceFromSysfs(gpu_sysfs_path_info, gpu_device));

After:

auto drm_status = GetGpuDeviceFromSysfs(gpu_sysfs_path_info, gpu_device);
if (!drm_status.IsOK()) {
  LOGS_DEFAULT(WARNING) << "Skipping DRM device at " << gpu_sysfs_path_info.path << ": " << drm_status.ErrorMessage();
  continue;
}

Testing

Verified on AWS EC2 g5.xlarge (NVIDIA A10G, Ubuntu 24.04):

Before: GetEpDevices() returns only CPUExecutionProvider; warning: GPU device discovery failed: Failed to open file: "/sys/class/drm/card0/device/vendor"
After: GetEpDevices() returns both CPUExecutionProvider and CUDAExecutionProvider; warning: Skipping DRM device at "/sys/class/drm/card0": Failed to open file: "/sys/class/drm/card0/device/vendor" (PCI fallback finds GPU correctly)

Motivation and Context

This blocks the plugin EP architecture (RegisterExecutionProviderLibrary, CopyTensors, shared allocators) on all AWS EC2 GPU instances where the vGPU driver does not expose nvidia-drm vendor metadata via sysfs.

@tianleiwu

On AWS EC2 vGPU instances (e.g. g5.xlarge with A10G), /sys/class/drm/card0 exists (simple-framebuffer for console VGA) but lacks device/vendor. GetGpuDeviceFromSysfs fails, and ORT_RETURN_IF_ERROR propagates the error immediately, preventing the PCI fallback at line 289 from ever running. The PCI path has the correct data: /sys/bus/pci/devices/0000:00:1e.0/vendor = 0x10de (NVIDIA) /sys/bus/pci/devices/0000:00:1e.0/class = 0x030200 (3D controller) Change the DRM loop to log a warning and continue instead of returning error, as suggested by @tianleiwu during the microsoft#27591 review. This allows the loop to skip non-GPU DRM entries (like simple-framebuffer) and fall through to the PCI fallback when no valid GPU is found via DRM. Fixes microsoft#27806

edgchen1 · 2026-03-24T19:08:07Z

onnxruntime/core/platform/linux/device_discovery.cc

+    auto drm_status = GetGpuDeviceFromSysfs(gpu_sysfs_path_info, gpu_device);
+    if (!drm_status.IsOK()) {
+      LOGS_DEFAULT(WARNING) << "Skipping DRM device at " << gpu_sysfs_path_info.path << ": " << drm_status.ErrorMessage();
+      continue;


if it fails part of the way into the loop (if gpu_devices.size() > 0) then it also won't get to the PCI fallback logic. is that fine?

edgchen1 · 2026-03-24T19:08:42Z

onnxruntime/core/platform/linux/device_discovery.cc


 Status GetGpuDevices(std::vector<OrtHardwareDevice>& gpu_devices_out) {
  std::vector<GpuSysfsPathInfo> gpu_sysfs_path_infos{};
  ORT_RETURN_IF_ERROR(DetectGpuSysfsPaths(gpu_sysfs_path_infos));


would an early return from DetectGpuSysfsPaths() also cause issues?

Copilot

Pull request overview

This PR fixes Linux GPU device discovery on AWS EC2 vGPU instances where /sys/class/drm/cardN entries exist but are missing required metadata (e.g., device/vendor), by preventing a single DRM parsing failure from aborting discovery and thereby allowing the existing PCI fallback path to run.

Changes:

Update the DRM sysfs scan loop in GetGpuDevices() to log a warning and skip invalid DRM entries instead of returning early on error.
Preserve the existing behavior where PCI bus scanning is used when DRM-based discovery yields zero valid GPUs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

edgchen1 reviewed Mar 24, 2026

View reviewed changes

edgchen1 requested a review from Copilot March 24, 2026 19:09

Copilot started reviewing on behalf of edgchen1 March 24, 2026 19:11 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DRM loop error propagation bypassing PCI fallback on vGPU instances#27810

Fix DRM loop error propagation bypassing PCI fallback on vGPU instances#27810
f-dy wants to merge 1 commit intomicrosoft:mainfrom
f-dy:fix/drm-loop-graceful-degradation

f-dy commented Mar 23, 2026

Uh oh!

edgchen1 Mar 24, 2026

Uh oh!

edgchen1 Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

f-dy commented Mar 23, 2026

Description

Change

Testing

Motivation and Context

Uh oh!

edgchen1 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

edgchen1 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants