Skip to content

Conversation

@emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Mar 5, 2022

Trying to formalize discussion on from mailing list thread

@github-actions
Copy link

github-actions bot commented Mar 5, 2022

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

update tag refs.
Comment on lines 325 to 328
// The data present for inlined_data represents only a partial
// sample of the data available for the ticket. No guarantees on the
// ordering are provided.
SAMPLE_DATA = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the semantics for this would be perhaps a .head() or LIMIT <n> so a client can preview some arbitrary amount of data and call DoGet for the entire dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i think this is what @BryanCutler suggested on the mailing list thread.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be a usecase for SAMPLE_DATA?

I'm thinking that something like LIMIT would probably come as part of the query, so for IE LIMIT 10 it would be COMPLETE_DATA but the limit is implemented data-side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use-case on the mailing list was data preview. @BryanCutler might have something more specific. But one can imagine wanting to show a few rows of data in a UI for users with lowest possible latency while all data is fetched.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this type of behavior could also be encoded using application specific metadata (rather than explicitly in the Flight protocol itself)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlightInfo lacks application specific metadata, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this type of behavior could also be encoded using application specific metadata (rather than explicitly in the Flight protocol itself)

Lets try to consolidate this thread with the one above?

// limitations. Furthermore, since the data is expected to be small, implementations
// are less likely to optimize for zero-copy in these cases.
repeated FlightData inlined_data = 2;
enum InlinedCompleteness {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an enum is reasonable if we want to extend things

@GavinRay97
Copy link

This would be helpful for my usecase, I expect most reponses to be sub-1MB and found the two-part exchange for Flights (one to get a Ticket, another to exchange the Ticket for a stream of data) a bit heavy 👍

// inlining data is appropriate.
//
// The size of inlined_data is expected to be small (typically less then 1MB)
// and inlining too much data across tickets can run into underlying transport
Copy link

@GavinRay97 GavinRay97 Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any hard-set number to be aware of here? Or some general guidance/recommendations?
Seems important not to accidentally hit payload size limit in the transport

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might already be overriding but gRPC has a default of 4MB limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to document this since there are experiments with other transports.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that is good to know, thank you. Will have to see how that can be configured because there are some cases in which a query may fetch ~10MB of data (or more)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to the transport. I believe we override this by default in C++ and Java, yeah, but I don't think we should document it here.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal. Being able to avoid the double network request I think would definitely help us in the arrow-rs implementation (e.g. apache/arrow-rs#1386)

cc @wangfenjin @tustvold and @viirya

//
// The inlined data is not expected to contain schema metadata. The schema
// should be identical to the schema provided on FlightInfo.
repeated FlightData inlined_data = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 to adding an optional FlightData field here. It would be great to avoid having to make a subsequent DoGet request

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, looks like a good addition.

// The inlined data is not expected to contain schema metadata. The schema
// should be identical to the schema provided on FlightInfo.
repeated FlightData inlined_data = 2;
enum InlinedCompleteness {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Ticket is already an opaque data blob that is interpreted by the application, I am not sure how much extra value adding a generic inlined_data_completeness would be.

For applications that need to send NO_INLINED_DATA they simply omit inlined_data

For applications that need to send data, they could include inlined_data and could distinguish between "complete" and "sample data" somehow in the application defined Ticket payload anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ticket is meant to be opaque to the client, so it may not be appropriate to use for this? (I've always thought of it as analogous to cookies or HTTP session state - the client carries it and passes it back to the server, but doesn't manipulate it.) That said I do agree that a general application metadata field somewhere in FlightInfo would likely be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Ticket is already an opaque data blob that is interpreted by the application, I am not sure how much extra value adding a generic inlined_data_completeness would be.

For applications that need to send NO_INLINED_DATA they simply omit inlined_data

I'm going to simplify this to make a nested message for inlined_data which cleans this up a little bit.

For applications that need to send data, they could include inlined_data and could distinguish between "complete" and "sample data" somehow in the application defined Ticket payload anyways.

I agree with David on the opaqueness of Ticket which is my understanding also. I think the benefit of directly modeling complete vs incomplete, is at least for incomplete data, we can completely elide the details from consumers. Users could directly pass the ticket to DoGet and the framework can recognize there is inlined data present and its complete and just use that without existing clients making any code changes. I haven't started implementing this but that was the approach I was going to look at first to see if it is feasible.

For sampled data it is always going to require a client side change to make use of it unless we the framework provides stronger guarantees (i.e. data is always the same exact data as the first part of the stream).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing the idea that the client can be general and not bound to a specific server implementation. I probably don't understand how that would work in practice, because if for example the client needs to call DoGet to start the server computing something, the client needs to provide a Ticket it got somewhere 🤔

I see now that perhaps FlightDescriptor::cmd is perhaps the correct place

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a client can easily be fully general, but the flow is meant to go GetFlightInfo(FlightDescriptor) -> FlightInfo, then DoGet(Ticket), where the Ticket comes from the FlightInfo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DoGet to start the server computing something, the client needs to provide a Ticket it got somewhere 🤔

This is a good point also that there is an assumption for this that most of the work starts as part of GetFlightInfo or before. This type of thing would not directly apply if their computation didn't start until DoGet.

At least for SQL and other types of storage systems I think this assumption is reasonable but agree not universally applicable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I / we may have misunderstood the expected flow for using Flight -- hopefully as we migrate to using FlightSQL (as we plan to do for IOx) we'll end up conforming more closely to the intended use

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be interested in #12636 where we try to clarify some of these things, also #12616 which tries to clarify it in the context of Flight SQL

Comment on lines 325 to 328
// The data present for inlined_data represents only a partial
// sample of the data available for the ticket. No guarantees on the
// ordering are provided.
SAMPLE_DATA = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this type of behavior could also be encoded using application specific metadata (rather than explicitly in the Flight protocol itself)

@wangfenjin
Copy link

enum InlinedCompleteness {
// Default is no data is inlined. An UNDEFINED value is not provided because an enum
// for the data that the client isn't aware of makes the data unusable anyways.
NO_INLINED_DATA = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NO_INLINED_DATA = 0;
NO_DATA = 0;

As this is already for inline data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be changed to undefined. I'm going to move the enum and the FlightData into a nested message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea. If the inlined data is optional, do we still need NO_INLINED_DATA/undefined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating as we speak. Will push a new version in a few minutes hopefully. Convention for eums in protobufs is to make the first element undefined because it allows for clients to detech non-backward compatible changes by the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be updated now.

//
// The inlined data is not expected to contain schema metadata. The schema
// should be identical to the schema provided on FlightInfo.
repeated FlightData inlined_data = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, looks like a good addition.

//
// `inlined_completeness` indicates what part of the data retriavable
// by the ticket this represents. This is provided as an optimization for
// client/server applications that want to reduce latency to first result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "reduce latency to first result" sound like the inlined data is always for first rows (.head())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, open to suggestions on phrasing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may mention it as "to get partial result" or "to get preview result".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, tried rephrasing a little bit.

@wangfenjin
Copy link

It's a good addition, but I'm not sure if it's a good idea to add this complexity into Flight protocol.

The idea is we may want to preview sample data, or if the datasets is small we can get the data in one request. Do you think we can use DoAction() API to accomplish this goal? If yes, we can do that in FlightSQL protocol

@emkornfield
Copy link
Contributor Author

The idea is we may want to preview sample data, or if the datasets is small we can get the data in one request. Do you think we can use DoAction() API to accomplish this goal? If yes, we can do that in FlightSQL protocol

No, I don't think so. My understanding is DoAction implies the results will be smallish. One could potentially use DoAction to return a FlightInfo + optional inlined data. However, for queries FlightSQL mandates using GetFlightInfo, again we could change this to also pass it to also DoAction, but this seems to side step the intended use of GetFlightInfo(). This also runs into the same problem. If the data turns out to be large, then another RPC needed to made to start retrieving it (although the latency hit for larger data sets is probably generally less important).

The crux of the problem is clients don't necessarily know the size of the results when calling GetFlightInfo. I don't think this is specific to SQL hence the proposal for putting it in Flight proper.

One could argue that at some point we might want having a streaming RPC (a combination of DoAction and GetFlightInfo) that just returns tickets for absolute minimal latency but that is a much larger change and makes sense in less scenarios.

If you where thinking something different here let me know.

The idea is we may want to preview sample data,

IIUC this is about a better UX where you could show some data to the user immediately while the rest downloads in the background. This was incorporating a suggestion from the mailing list, my larger concern is complete results.

Co-authored-by: Liang-Chi Hsieh <[email protected]>
@emkornfield
Copy link
Contributor Author

https://lists.apache.org/thread/o1rq3zzxppnvlfwrs9kzf802go44068h0

404

Fixed in the description (added a zero instead of a close parenthesis)

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this. Initially I was a little apprehensive of doing this in GetFlightInfo, but as I thought about it, it makes sense and would be a useful addition.

// The data present in inlined_data represents only a partial
// sample of the data available for the ticket. No guarantees on the
// ordering are provided.
SAMPLE_DATA = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about calling this PARTIAL_DATA? The term sample might lead some to think it would be a random sample, so just calling it partial would be a little more generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better name. Changed.

message InlinedData {
// Arrow data for consumption.
//
// The data is not expected to contain schema metadata. The schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess nit, but maybe 'not expected to contain the schema message'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

//
// The data is not expected to contain the schema message. The schema
// should be identical to the schema provided on FlightInfo.
repeated FlightData data = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be reordered? Such that
inlined_completeness = 1
data = 2
That way if inlined_completeness == DATA_SUBSET_TYPE_UNDEFINED then there would be no need to read data.

This would then follow the similar pattern in FlightDescriptor where DescriptorType precedes the fields cmd and path whose contents or not is defined based on DescriptorType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Apr 21, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Apr 22, 2023
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@github-actions github-actions bot closed this Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting change review Awaiting change review Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants