-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15921: [Format][FlightRPC][C++][Java] Clarify interpretation of FlightEndpoint.locations #12636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just had a minor question
format/Flight.proto
Outdated
| * If the list is not empty, the expectation is that the ticket can | ||
| * be redeemed at any of the locations, and that the data returned | ||
| * will be equivalent. In other words, multiple locations provide | ||
| * redundancy/load balancing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an intention that if the endpoint list is not empty, the ticket must be redeemed from one of those and not the service that generated the ticket, or is that up to interpretation by the application?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is ambiguous still, but I think the former was intended. For instance from the original Flight blog post:
Nodes in a distributed cluster can take on different roles. For example, a subset of nodes might be responsible for planning queries while other nodes exclusively fulfill data stream (DoGet or DoPut) requests.
@jacques-n what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason the original might be thought of as ambiguous is that people are failing to see that tickets were conceived of as single use. If tickets are single use, once I go to one location and use the ticket, I can't then use the ticket at a second location. A flight must be a consumption of all tickets to be complete. I wonder if instead of expressing here, we should clarify elsewhere that a ticket is single use and reuse of ticket results in undefined behavior. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be good to clarify, but that doesn't answer the question of, "if there are locations, can the ticket still be redeemed on the original server, or must it only be redeemed on one of the given locations"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. The intention was, if there are locations, you must use one of the locations to redeem the ticket. One of the locations may be the current server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments, I've updated this PR with both the Ticket and locations clarifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable to me
...ion-tests/src/main/java/org/apache/arrow/flight/integration/tests/IntegrationTestClient.java
Outdated
Show resolved
Hide resolved
...ion-tests/src/main/java/org/apache/arrow/flight/integration/tests/IntegrationTestClient.java
Outdated
Show resolved
Hide resolved
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a thought on clarifying a little more about multiple endpoints, but otherwise looks good
format/Flight.proto
Outdated
| * A list of endpoints associated with the flight. To consume the | ||
| * whole flight, all endpoints (and hence all Tickets) must be consumed. | ||
| * | ||
| * In other words, multiple endpoints provide partitioning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to come back to this, but I've had people come to me thinking that Flight will do the partitioning for them. So we don't want to make it seem like that is what's happening. Maybe reword slightly to something like "In other words, multiple endpoints can be used with partitioned data." or just "... multiple endpoints provide for partitioned data." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good point. I've reworded things here, how does it look now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by your comment here a little bit. Flight can do partitioning. It is done by returning more than one ticket for a particular flight. (I agree that locations aren't about partitioning.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it's because an application can implement partitioning via Flight, but Bryan has seen people thinking this means that using Flight will automatically partition/distribute data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was that Flight doesn't have built-in logic to actively partition a given data set. It's up to the service implementation to assign partitions to a given ticket.
format/Flight.proto
Outdated
| * A list of endpoints associated with the flight. To consume the whole | ||
| * flight, all endpoints must be consumed. | ||
| * A list of endpoints associated with the flight. To consume the | ||
| * whole flight, all endpoints (and hence all Tickets) must be consumed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth clarifying that the list of endpoints can be consumed in any order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Though - that brings up the question of how to represent an ordered dataset. Is the expectation that (say) if results are sorted, then the endpoints should also be sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought here is that the backend should enforce ordering by grouping ordered data together under one ticket. The client can't process this data in parallel anyway.
This would prevent exposing ordered fragments from separate locations though. It sort of makes sense for a query engine to have to collate data under a single node though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Though I think this is getting far out of scope of the original clarification now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jduo on this comment. If data must be ordered, at least for now, it should be exposed via single ticket/endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a short clarification.
jacques-n
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the proto changes. I didn't review the code changes.
|
@emkornfield, @pitrou: any thoughts on the Flight spec clarifications here? |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clarification is very useful, thank you. Just two comments.
format/Flight.proto
Outdated
| * A list of URIs where this ticket can be redeemed. If the list is | ||
| * empty, the expectation is that the ticket can only be redeemed on the | ||
| * current service where the ticket was generated. | ||
| * A list of URIs where this ticket can be redeemed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps also explain how someone is supposed to "redeem" a ticket (note that verb doesn't seem to exist in the RPC APIs)?
format/Flight.proto
Outdated
| * An opaque identifier that the service can use to retrieve a particular | ||
| * portion of a stream. | ||
| * | ||
| * Tickets are meant to be single use. It is an error/undefined behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"undefined behavior" sounds a bit scary given the baggage this term has in C/C++. Perhaps say that the behavior can be implement-dependent and that implementations are allowed to raise an error?
|
I filed ARROW-15987 for the CI failure (this happens to HEAD as well) |
|
Thanks for working through this @lidavidm! |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to format/Flight.proto I think are a significant clarification. Thank you
| * at one of the given locations, and not (necessarily) on the | ||
| * current service. | ||
| * | ||
| * In other words, an application can use multiple locations to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this clarification is very helpful to me
|
Benchmark runs are scheduled for baseline = 012ae6e and contender = a17137f. a17137f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
… FlightEndpoint.locations Clarify that an empty list of locations means we should fetch data from the same server, and that multiple locations mean that all locations are equivalent. See ["[Java] [Flight] Usage of Locations on FlightEndpoint"](https://lists.apache.org/thread/1668fk1myqf8168xh296qck5fn8ztcmn) on user@ for the original discussion. Closes apache#12636 from lidavidm/arrow-15921 Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>
Clarify that an empty list of locations means we should fetch data from the same server, and that multiple locations mean that all locations are equivalent.
See "[Java] [Flight] Usage of Locations on FlightEndpoint" on user@ for the original discussion.