Skip to content

Conversation

@lidavidm
Copy link
Member

Clarify that an empty list of locations means we should fetch data from the same server, and that multiple locations mean that all locations are equivalent.

See "[Java] [Flight] Usage of Locations on FlightEndpoint" on user@ for the original discussion.

@github-actions
Copy link

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just had a minor question

* If the list is not empty, the expectation is that the ticket can
* be redeemed at any of the locations, and that the data returned
* will be equivalent. In other words, multiple locations provide
* redundancy/load balancing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an intention that if the endpoint list is not empty, the ticket must be redeemed from one of those and not the service that generated the ticket, or is that up to interpretation by the application?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is ambiguous still, but I think the former was intended. For instance from the original Flight blog post:

Nodes in a distributed cluster can take on different roles. For example, a subset of nodes might be responsible for planning queries while other nodes exclusively fulfill data stream (DoGet or DoPut) requests.

@jacques-n what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason the original might be thought of as ambiguous is that people are failing to see that tickets were conceived of as single use. If tickets are single use, once I go to one location and use the ticket, I can't then use the ticket at a second location. A flight must be a consumption of all tickets to be complete. I wonder if instead of expressing here, we should clarify elsewhere that a ticket is single use and reuse of ticket results in undefined behavior. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be good to clarify, but that doesn't answer the question of, "if there are locations, can the ticket still be redeemed on the original server, or must it only be redeemed on one of the given locations"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The intention was, if there are locations, you must use one of the locations to redeem the ticket. One of the locations may be the current server.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, I've updated this PR with both the Ticket and locations clarifications

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a thought on clarifying a little more about multiple endpoints, but otherwise looks good

* A list of endpoints associated with the flight. To consume the
* whole flight, all endpoints (and hence all Tickets) must be consumed.
*
* In other words, multiple endpoints provide partitioning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to come back to this, but I've had people come to me thinking that Flight will do the partitioning for them. So we don't want to make it seem like that is what's happening. Maybe reword slightly to something like "In other words, multiple endpoints can be used with partitioned data." or just "... multiple endpoints provide for partitioned data." ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point. I've reworded things here, how does it look now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by your comment here a little bit. Flight can do partitioning. It is done by returning more than one ticket for a particular flight. (I agree that locations aren't about partitioning.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's because an application can implement partitioning via Flight, but Bryan has seen people thinking this means that using Flight will automatically partition/distribute data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was that Flight doesn't have built-in logic to actively partition a given data set. It's up to the service implementation to assign partitions to a given ticket.

* A list of endpoints associated with the flight. To consume the whole
* flight, all endpoints must be consumed.
* A list of endpoints associated with the flight. To consume the
* whole flight, all endpoints (and hence all Tickets) must be consumed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth clarifying that the list of endpoints can be consumed in any order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Though - that brings up the question of how to represent an ordered dataset. Is the expectation that (say) if results are sorted, then the endpoints should also be sorted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought here is that the backend should enforce ordering by grouping ordered data together under one ticket. The client can't process this data in parallel anyway.

This would prevent exposing ordered fragments from separate locations though. It sort of makes sense for a query engine to have to collate data under a single node though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Though I think this is getting far out of scope of the original clarification now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jduo on this comment. If data must be ordered, at least for now, it should be exposed via single ticket/endpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a short clarification.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the proto changes. I didn't review the code changes.

@lidavidm
Copy link
Member Author

@emkornfield, @pitrou: any thoughts on the Flight spec clarifications here?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clarification is very useful, thank you. Just two comments.

* A list of URIs where this ticket can be redeemed. If the list is
* empty, the expectation is that the ticket can only be redeemed on the
* current service where the ticket was generated.
* A list of URIs where this ticket can be redeemed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also explain how someone is supposed to "redeem" a ticket (note that verb doesn't seem to exist in the RPC APIs)?

* An opaque identifier that the service can use to retrieve a particular
* portion of a stream.
*
* Tickets are meant to be single use. It is an error/undefined behavior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"undefined behavior" sounds a bit scary given the baggage this term has in C/C++. Perhaps say that the behavior can be implement-dependent and that implementations are allowed to raise an error?

@lidavidm
Copy link
Member Author

I filed ARROW-15987 for the CI failure (this happens to HEAD as well)

@jacques-n
Copy link
Contributor

Thanks for working through this @lidavidm!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to format/Flight.proto I think are a significant clarification. Thank you

* at one of the given locations, and not (necessarily) on the
* current service.
*
* In other words, an application can use multiple locations to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this clarification is very helpful to me

@lidavidm lidavidm closed this in a17137f Mar 24, 2022
@lidavidm lidavidm deleted the arrow-15921 branch March 24, 2022 20:01
@ursabot
Copy link

ursabot commented Mar 24, 2022

Benchmark runs are scheduled for baseline = 012ae6e and contender = a17137f. a17137f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.71% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
… FlightEndpoint.locations

Clarify that an empty list of locations means we should fetch data from the same server, and that multiple locations mean that all locations are equivalent.

See ["[Java] [Flight] Usage of Locations on FlightEndpoint"](https://lists.apache.org/thread/1668fk1myqf8168xh296qck5fn8ztcmn) on user@ for the original discussion.

Closes apache#12636 from lidavidm/arrow-15921

Authored-by: David Li <[email protected]>
Signed-off-by: David Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants