Skip to content

Fix DRM loop error propagation bypassing PCI fallback on vGPU instances#27810

Open
f-dy wants to merge 1 commit intomicrosoft:mainfrom
f-dy:fix/drm-loop-graceful-degradation
Open

Fix DRM loop error propagation bypassing PCI fallback on vGPU instances#27810
f-dy wants to merge 1 commit intomicrosoft:mainfrom
f-dy:fix/drm-loop-graceful-degradation

Conversation

@f-dy
Copy link
Copy Markdown

@f-dy f-dy commented Mar 23, 2026

Description

Fixes #27806.

On AWS EC2 vGPU instances (e.g. g5.xlarge with A10G), /sys/class/drm/card0 exists as a simple-framebuffer for console VGA but lacks device/vendor. GetGpuDeviceFromSysfs fails, and ORT_RETURN_IF_ERROR propagates the error immediately, preventing the PCI fallback (added in #27591) from ever running.

The PCI path has the correct data:

/sys/bus/pci/devices/0000:00:1e.0/vendor = 0x10de  (NVIDIA)
/sys/bus/pci/devices/0000:00:1e.0/class  = 0x030200 (3D controller)

Change

Change the DRM loop in GetGpuDevices() to log a warning and continue instead of ORT_RETURN_IF_ERROR, as suggested by @tianleiwu during the #27591 review. This allows the loop to skip non-GPU DRM entries (like simple-framebuffer) and fall through to the PCI fallback when no valid GPU is found via DRM.

Before:

ORT_RETURN_IF_ERROR(GetGpuDeviceFromSysfs(gpu_sysfs_path_info, gpu_device));

After:

auto drm_status = GetGpuDeviceFromSysfs(gpu_sysfs_path_info, gpu_device);
if (!drm_status.IsOK()) {
  LOGS_DEFAULT(WARNING) << "Skipping DRM device at " << gpu_sysfs_path_info.path << ": " << drm_status.ErrorMessage();
  continue;
}

Testing

Verified on AWS EC2 g5.xlarge (NVIDIA A10G, Ubuntu 24.04):

  • Before: GetEpDevices() returns only CPUExecutionProvider; warning: GPU device discovery failed: Failed to open file: "/sys/class/drm/card0/device/vendor"
  • After: GetEpDevices() returns both CPUExecutionProvider and CUDAExecutionProvider; warning: Skipping DRM device at "/sys/class/drm/card0": Failed to open file: "/sys/class/drm/card0/device/vendor" (PCI fallback finds GPU correctly)

Motivation and Context

This blocks the plugin EP architecture (RegisterExecutionProviderLibrary, CopyTensors, shared allocators) on all AWS EC2 GPU instances where the vGPU driver does not expose nvidia-drm vendor metadata via sysfs.

On AWS EC2 vGPU instances (e.g. g5.xlarge with A10G), /sys/class/drm/card0
exists (simple-framebuffer for console VGA) but lacks device/vendor.
GetGpuDeviceFromSysfs fails, and ORT_RETURN_IF_ERROR propagates the error
immediately, preventing the PCI fallback at line 289 from ever running.

The PCI path has the correct data:
  /sys/bus/pci/devices/0000:00:1e.0/vendor = 0x10de (NVIDIA)
  /sys/bus/pci/devices/0000:00:1e.0/class  = 0x030200 (3D controller)

Change the DRM loop to log a warning and continue instead of returning
error, as suggested by @tianleiwu during the microsoft#27591 review. This allows
the loop to skip non-GPU DRM entries (like simple-framebuffer) and fall
through to the PCI fallback when no valid GPU is found via DRM.

Fixes microsoft#27806
auto drm_status = GetGpuDeviceFromSysfs(gpu_sysfs_path_info, gpu_device);
if (!drm_status.IsOK()) {
LOGS_DEFAULT(WARNING) << "Skipping DRM device at " << gpu_sysfs_path_info.path << ": " << drm_status.ErrorMessage();
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it fails part of the way into the loop (if gpu_devices.size() > 0) then it also won't get to the PCI fallback logic. is that fine?


Status GetGpuDevices(std::vector<OrtHardwareDevice>& gpu_devices_out) {
std::vector<GpuSysfsPathInfo> gpu_sysfs_path_infos{};
ORT_RETURN_IF_ERROR(DetectGpuSysfsPaths(gpu_sysfs_path_infos));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would an early return from DetectGpuSysfsPaths() also cause issues?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Linux GPU device discovery on AWS EC2 vGPU instances where /sys/class/drm/cardN entries exist but are missing required metadata (e.g., device/vendor), by preventing a single DRM parsing failure from aborting discovery and thereby allowing the existing PCI fallback path to run.

Changes:

  • Update the DRM sysfs scan loop in GetGpuDevices() to log a warning and skip invalid DRM entries instead of returning early on error.
  • Preserve the existing behavior where PCI bus scanning is used when DRM-based discovery yields zero valid GPUs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PCI fallback unreachable on AWS EC2 vGPU — DRM loop error propagation bypasses fallback (#27591)

3 participants