Skip to content

[CORE]: Improve filesystem error messages during Linux device discovery#27289

Open
theHamsta wants to merge 2 commits intomicrosoft:mainfrom
theHamsta:sseitz/better-error-message-ErrorCodeToStatus
Open

[CORE]: Improve filesystem error messages during Linux device discovery#27289
theHamsta wants to merge 2 commits intomicrosoft:mainfrom
theHamsta:sseitz/better-error-message-ErrorCodeToStatus

Conversation

@theHamsta
Copy link
Copy Markdown
Contributor

Description

This is a follow-up to #26210
to address #26210 (comment) and review dog lints.

ErrorCodeToStatus currently does not include the filesystem path that
caused the error. This could it make difficult to know the root cause
of a reported filesystem error.

Review dog lints: https://github.com/microsoft/onnxruntime/pull/26210/changes
Plus a typo: dit -> did

Motivation and Context

Clean up discussed issues and lints of #26210

namespace {

Status ErrorCodeToStatus(const std::error_code& ec) {
Status ErrorCodeToStatus(const std::error_code& ec, const fs::path& path, std::string_view context) {
Copy link
Copy Markdown
Contributor Author

@theHamsta theHamsta Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the string "context" could be overkill. I thought it would provide useful context, why we accessed the file. But I would at least include the filesystem path that caused the error code. Otherwise, debugging the error code could be difficult

@theHamsta theHamsta changed the title Sseitz/better error message error code to status [CORE]: Address review dog lints in microsoft#26210 (Linux device discovery) Feb 9, 2026
@theHamsta theHamsta changed the title [CORE]: Address review dog lints in microsoft#26210 (Linux device discovery) [CORE]: Improve filesystem error messages during Linux device discovery Feb 9, 2026
@theHamsta theHamsta force-pushed the sseitz/better-error-message-ErrorCodeToStatus branch from da15088 to e414ecd Compare March 25, 2026 11:52
@theHamsta
Copy link
Copy Markdown
Contributor Author

@edgchen1 on one of our test systems the device discovery fails for the iGPU (there are permission problems for this one file). This has the effect that even card1 which would be the Nvidia GPU we actually want to use does not get discovered.

Could we change failures for individual sysfs paths to warnings while still allowing the discovery of other devices?

0;93m2026-04-07 21:25:38.450380746 [W:onnxruntime:Default, device_discovery.cc:283 GetGpuDevices] Failed to detect devices under "/sys/class/drm/card0": device_discovery.cc:93 ReadFileContents Failed to open file: "/sys/class/drm/card0/device/vendor"

Before this change you would just get a failure without even knowing that a file path for the iGPU is at fault

@theHamsta theHamsta force-pushed the sseitz/better-error-message-ErrorCodeToStatus branch from e414ecd to 968349d Compare April 8, 2026 08:51
@gedoensmax
Copy link
Copy Markdown
Contributor

@chilo-ms can you trigger the CI and see if we can get this merged ?

@chilo-ms
Copy link
Copy Markdown
Contributor

chilo-ms commented Apr 8, 2026

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves Linux GPU device discovery diagnostics by enriching filesystem-related Status errors with the failing path and additional context, and by making per-device discovery failures non-fatal (logged and skipped) to allow discovery to proceed.

Changes:

  • Extend ErrorCodeToStatus to include filesystem path + a human-readable context string in error messages.
  • Use the richer error construction at key filesystem calls (exists, directory_iterator, canonical).
  • Change GPU enumeration to warn-and-continue on per-device failures (both DRM-sysfs and PCI fallback paths).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +26 to +33
Status ErrorCodeToStatus(const std::error_code& ec, const std::filesystem::path& path, const std::string& context) {
if (!ec) {
return Status::OK();
}

return Status{common::StatusCategory::ONNXRUNTIME, common::StatusCode::FAIL,
MakeString("Error: std::error_code with category name: ", ec.category().name(),
", value: ", ec.value(), ", message: ", ec.message())};
", value: ", ec.value(), ", message: ", ec.message(), ", filesystem path: ", path, ", context: ", context)};
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrorCodeToStatus takes const std::string& context, but all call sites pass string literals, which will construct temporary std::strings (heap allocation) just to format the message. Consider changing this parameter to std::string_view or const char* to avoid the extra allocation while keeping the same behavior in MakeString.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had doubts with usage of string_view before because not much used in the code base but now changed it to string_view

@edgchen1
Copy link
Copy Markdown
Contributor

edgchen1 commented Apr 9, 2026

Could we change failures for individual sysfs paths to warnings while still allowing the discovery of other devices?

sure, I think that's fine.

@theHamsta theHamsta force-pushed the sseitz/better-error-message-ErrorCodeToStatus branch from a18ffce to baf91f0 Compare April 10, 2026 08:00
ErrorCodeToStatus currently does not include the filesystem path that
caused the error. This could it make difficult to know the root cause
of a reported filesystem error.
@theHamsta theHamsta force-pushed the sseitz/better-error-message-ErrorCodeToStatus branch from baf91f0 to 80862bd Compare April 10, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants