[Core] Fix ray::Status <--> gRPC status interplay. #14278
Labels
core
Issues that should be addressed in Ray Core
enhancement
Request for new feature and/or capability
P2
Important issue, but not time-critical
RFC
RFC issues
Milestone
There isn’t a clean
ray::Status
<--> gRPC status conversion for allray::Status
es, yet we're pretending that there is by recasting every server-sideray::Status
as anIOError
client-side. This is very confusing to Ray devs when they, say, return aStatus::ObjectNotFound
status from the server that is recast as a genericStatus::IOError
on the client. We should fix this conversion so application code can properly exchange and interpret application-level errors, while maintaining support for transport-level gRPC errors.Options
Two options are immediately apparent:
ray::Status
in our proto payloads. Transport-level errors would still be handled via the gRPC status, but application-level errors that don’t map to gRPC status codes would be defined in the reply proto, alongside the normal payload. How aggressively we should try to map a subset ofray::Status
es to gRPC statuses requires some thought, e.g. should an "object not found" error be an application-levelStatus::ObjectNotFound
error, or should that be mapped to theNOT_FOUND
gRPC status code at the transport level? The best practice consensus is to avoid defining "specific resource has X state" codes when a generic "resource has X state" code exists, i.e. that we should use theNOT_FOUND
gRPC status code where possible.ray::Status
-esque errors would go into the error details. We would still do a best-effort mapping ofray::Status
codes to gRPC status codes, falling back to gRPC status codeUNKNOWN
when a good mapping doesn't exist. Support for the richer error model exists for our core language (C++), our current frontend (worker) languages (Python, Java, C++), and potential future core languages (Go, Rust), but no support yet for grpc-web or Node.js. I think that this support will suffice for our needs, especially given that the richer error model should be opt-in for each RPC.I believe that option (2), the richer error model, is the best approach.
The text was updated successfully, but these errors were encountered: