-
Notifications
You must be signed in to change notification settings - Fork 4k
RFC: Add inlined data to flight. #12571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
update tag refs.
format/Flight.proto
Outdated
| // The data present for inlined_data represents only a partial | ||
| // sample of the data available for the ticket. No guarantees on the | ||
| // ordering are provided. | ||
| SAMPLE_DATA = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the semantics for this would be perhaps a .head() or LIMIT <n> so a client can preview some arbitrary amount of data and call DoGet for the entire dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i think this is what @BryanCutler suggested on the mailing list thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be a usecase for SAMPLE_DATA?
I'm thinking that something like LIMIT would probably come as part of the query, so for IE LIMIT 10 it would be COMPLETE_DATA but the limit is implemented data-side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use-case on the mailing list was data preview. @BryanCutler might have something more specific. But one can imagine wanting to show a few rows of data in a UI for users with lowest possible latency while all data is fetched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this type of behavior could also be encoded using application specific metadata (rather than explicitly in the Flight protocol itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FlightInfo lacks application specific metadata, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this type of behavior could also be encoded using application specific metadata (rather than explicitly in the Flight protocol itself)
Lets try to consolidate this thread with the one above?
format/Flight.proto
Outdated
| // limitations. Furthermore, since the data is expected to be small, implementations | ||
| // are less likely to optimize for zero-copy in these cases. | ||
| repeated FlightData inlined_data = 2; | ||
| enum InlinedCompleteness { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an enum is reasonable if we want to extend things
|
This would be helpful for my usecase, I expect most reponses to be sub-1MB and found the two-part exchange for Flights (one to get a |
| // inlining data is appropriate. | ||
| // | ||
| // The size of inlined_data is expected to be small (typically less then 1MB) | ||
| // and inlining too much data across tickets can run into underlying transport |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any hard-set number to be aware of here? Or some general guidance/recommendations?
Seems important not to accidentally hit payload size limit in the transport
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might already be overriding but gRPC has a default of 4MB limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant to document this since there are experiments with other transports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that is good to know, thank you. Will have to see how that can be configured because there are some cases in which a query may fetch ~10MB of data (or more)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up to the transport. I believe we override this by default in C++ and Java, yeah, but I don't think we should document it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this proposal. Being able to avoid the double network request I think would definitely help us in the arrow-rs implementation (e.g. apache/arrow-rs#1386)
cc @wangfenjin @tustvold and @viirya
format/Flight.proto
Outdated
| // | ||
| // The inlined data is not expected to contain schema metadata. The schema | ||
| // should be identical to the schema provided on FlightInfo. | ||
| repeated FlightData inlined_data = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am +1 to adding an optional FlightData field here. It would be great to avoid having to make a subsequent DoGet request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, looks like a good addition.
format/Flight.proto
Outdated
| // The inlined data is not expected to contain schema metadata. The schema | ||
| // should be identical to the schema provided on FlightInfo. | ||
| repeated FlightData inlined_data = 2; | ||
| enum InlinedCompleteness { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the Ticket is already an opaque data blob that is interpreted by the application, I am not sure how much extra value adding a generic inlined_data_completeness would be.
For applications that need to send NO_INLINED_DATA they simply omit inlined_data
For applications that need to send data, they could include inlined_data and could distinguish between "complete" and "sample data" somehow in the application defined Ticket payload anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ticket is meant to be opaque to the client, so it may not be appropriate to use for this? (I've always thought of it as analogous to cookies or HTTP session state - the client carries it and passes it back to the server, but doesn't manipulate it.) That said I do agree that a general application metadata field somewhere in FlightInfo would likely be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the Ticket is already an opaque data blob that is interpreted by the application, I am not sure how much extra value adding a generic inlined_data_completeness would be.
For applications that need to send NO_INLINED_DATA they simply omit inlined_data
I'm going to simplify this to make a nested message for inlined_data which cleans this up a little bit.
For applications that need to send data, they could include inlined_data and could distinguish between "complete" and "sample data" somehow in the application defined Ticket payload anyways.
I agree with David on the opaqueness of Ticket which is my understanding also. I think the benefit of directly modeling complete vs incomplete, is at least for incomplete data, we can completely elide the details from consumers. Users could directly pass the ticket to DoGet and the framework can recognize there is inlined data present and its complete and just use that without existing clients making any code changes. I haven't started implementing this but that was the approach I was going to look at first to see if it is feasible.
For sampled data it is always going to require a client side change to make use of it unless we the framework provides stronger guarantees (i.e. data is always the same exact data as the first part of the stream).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am missing the idea that the client can be general and not bound to a specific server implementation. I probably don't understand how that would work in practice, because if for example the client needs to call DoGet to start the server computing something, the client needs to provide a Ticket it got somewhere 🤔
I see now that perhaps FlightDescriptor::cmd is perhaps the correct place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a client can easily be fully general, but the flow is meant to go GetFlightInfo(FlightDescriptor) -> FlightInfo, then DoGet(Ticket), where the Ticket comes from the FlightInfo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DoGet to start the server computing something, the client needs to provide a Ticket it got somewhere 🤔
This is a good point also that there is an assumption for this that most of the work starts as part of GetFlightInfo or before. This type of thing would not directly apply if their computation didn't start until DoGet.
At least for SQL and other types of storage systems I think this assumption is reasonable but agree not universally applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I / we may have misunderstood the expected flow for using Flight -- hopefully as we migrate to using FlightSQL (as we plan to do for IOx) we'll end up conforming more closely to the intended use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format/Flight.proto
Outdated
| // The data present for inlined_data represents only a partial | ||
| // sample of the data available for the ticket. No guarantees on the | ||
| // ordering are provided. | ||
| SAMPLE_DATA = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this type of behavior could also be encoded using application specific metadata (rather than explicitly in the Flight protocol itself)
format/Flight.proto
Outdated
| enum InlinedCompleteness { | ||
| // Default is no data is inlined. An UNDEFINED value is not provided because an enum | ||
| // for the data that the client isn't aware of makes the data unusable anyways. | ||
| NO_INLINED_DATA = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| NO_INLINED_DATA = 0; | |
| NO_DATA = 0; |
As this is already for inline data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be changed to undefined. I'm going to move the enum and the FlightData into a nested message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea. If the inlined data is optional, do we still need NO_INLINED_DATA/undefined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating as we speak. Will push a new version in a few minutes hopefully. Convention for eums in protobufs is to make the first element undefined because it allows for clients to detech non-backward compatible changes by the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be updated now.
format/Flight.proto
Outdated
| // | ||
| // The inlined data is not expected to contain schema metadata. The schema | ||
| // should be identical to the schema provided on FlightInfo. | ||
| repeated FlightData inlined_data = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, looks like a good addition.
format/Flight.proto
Outdated
| // | ||
| // `inlined_completeness` indicates what part of the data retriavable | ||
| // by the ticket this represents. This is provided as an optimization for | ||
| // client/server applications that want to reduce latency to first result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "reduce latency to first result" sound like the inlined data is always for first rows (.head())?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, open to suggestions on phrasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may mention it as "to get partial result" or "to get preview result".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, tried rephrasing a little bit.
|
It's a good addition, but I'm not sure if it's a good idea to add this complexity into Flight protocol. The idea is we may want to preview sample data, or if the datasets is small we can get the data in one request. Do you think we can use DoAction() API to accomplish this goal? If yes, we can do that in FlightSQL protocol |
No, I don't think so. My understanding is DoAction implies the results will be smallish. One could potentially use DoAction to return a FlightInfo + optional inlined data. However, for queries FlightSQL mandates using GetFlightInfo, again we could change this to also pass it to also DoAction, but this seems to side step the intended use of GetFlightInfo(). This also runs into the same problem. If the data turns out to be large, then another RPC needed to made to start retrieving it (although the latency hit for larger data sets is probably generally less important). The crux of the problem is clients don't necessarily know the size of the results when calling GetFlightInfo. I don't think this is specific to SQL hence the proposal for putting it in Flight proper. One could argue that at some point we might want having a streaming RPC (a combination of DoAction and GetFlightInfo) that just returns tickets for absolute minimal latency but that is a much larger change and makes sense in less scenarios. If you where thinking something different here let me know.
IIUC this is about a better UX where you could show some data to the user immediately while the rest downloads in the background. This was incorporating a suggestion from the mailing list, my larger concern is complete results. |
Co-authored-by: Liang-Chi Hsieh <[email protected]>
Fixed in the description (added a zero instead of a close parenthesis) |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this. Initially I was a little apprehensive of doing this in GetFlightInfo, but as I thought about it, it makes sense and would be a useful addition.
format/Flight.proto
Outdated
| // The data present in inlined_data represents only a partial | ||
| // sample of the data available for the ticket. No guarantees on the | ||
| // ordering are provided. | ||
| SAMPLE_DATA = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about calling this PARTIAL_DATA? The term sample might lead some to think it would be a random sample, so just calling it partial would be a little more generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better name. Changed.
format/Flight.proto
Outdated
| message InlinedData { | ||
| // Arrow data for consumption. | ||
| // | ||
| // The data is not expected to contain schema metadata. The schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess nit, but maybe 'not expected to contain the schema message'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
format/Flight.proto
Outdated
| // | ||
| // The data is not expected to contain the schema message. The schema | ||
| // should be identical to the schema provided on FlightInfo. | ||
| repeated FlightData data = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be reordered? Such that
inlined_completeness = 1
data = 2
That way if inlined_completeness == DATA_SUBSET_TYPE_UNDEFINED then there would be no need to read data.
This would then follow the similar pattern in FlightDescriptor where DescriptorType precedes the fields cmd and path whose contents or not is defined based on DescriptorType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
|
Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer. |
Trying to formalize discussion on from mailing list thread