[CORE]: Improve filesystem error messages during Linux device discovery#27289
[CORE]: Improve filesystem error messages during Linux device discovery#27289theHamsta wants to merge 2 commits intomicrosoft:mainfrom
Conversation
| namespace { | ||
|
|
||
| Status ErrorCodeToStatus(const std::error_code& ec) { | ||
| Status ErrorCodeToStatus(const std::error_code& ec, const fs::path& path, std::string_view context) { |
There was a problem hiding this comment.
the string "context" could be overkill. I thought it would provide useful context, why we accessed the file. But I would at least include the filesystem path that caused the error code. Otherwise, debugging the error code could be difficult
da15088 to
e414ecd
Compare
|
@edgchen1 on one of our test systems the device discovery fails for the iGPU (there are permission problems for this one file). This has the effect that even card1 which would be the Nvidia GPU we actually want to use does not get discovered. Could we change failures for individual sysfs paths to warnings while still allowing the discovery of other devices? Before this change you would just get a failure without even knowing that a file path for the iGPU is at fault |
e414ecd to
968349d
Compare
|
@chilo-ms can you trigger the CI and see if we can get this merged ? |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Improves Linux GPU device discovery diagnostics by enriching filesystem-related Status errors with the failing path and additional context, and by making per-device discovery failures non-fatal (logged and skipped) to allow discovery to proceed.
Changes:
- Extend
ErrorCodeToStatusto include filesystem path + a human-readable context string in error messages. - Use the richer error construction at key filesystem calls (
exists,directory_iterator,canonical). - Change GPU enumeration to warn-and-continue on per-device failures (both DRM-sysfs and PCI fallback paths).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Status ErrorCodeToStatus(const std::error_code& ec, const std::filesystem::path& path, const std::string& context) { | ||
| if (!ec) { | ||
| return Status::OK(); | ||
| } | ||
|
|
||
| return Status{common::StatusCategory::ONNXRUNTIME, common::StatusCode::FAIL, | ||
| MakeString("Error: std::error_code with category name: ", ec.category().name(), | ||
| ", value: ", ec.value(), ", message: ", ec.message())}; | ||
| ", value: ", ec.value(), ", message: ", ec.message(), ", filesystem path: ", path, ", context: ", context)}; |
There was a problem hiding this comment.
ErrorCodeToStatus takes const std::string& context, but all call sites pass string literals, which will construct temporary std::strings (heap allocation) just to format the message. Consider changing this parameter to std::string_view or const char* to avoid the extra allocation while keeping the same behavior in MakeString.
There was a problem hiding this comment.
I had doubts with usage of string_view before because not much used in the code base but now changed it to string_view
sure, I think that's fine. |
…covery) Review dog lints: https://github.com/microsoft/onnxruntime/pull/26210/changes Plus a typo: `dit` -> `did`
a18ffce to
baf91f0
Compare
ErrorCodeToStatus currently does not include the filesystem path that caused the error. This could it make difficult to know the root cause of a reported filesystem error.
baf91f0 to
80862bd
Compare
Description
This is a follow-up to #26210
to address #26210 (comment) and review dog lints.
ErrorCodeToStatus currently does not include the filesystem path that
caused the error. This could it make difficult to know the root cause
of a reported filesystem error.
Review dog lints: https://github.com/microsoft/onnxruntime/pull/26210/changes
Plus a typo:
dit->didMotivation and Context
Clean up discussed issues and lints of #26210