-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10046: [Rust] [DataFusion] Made RecordBatchReader implement Iterator
#8225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
*Iterator implement IteratorRecordBatchReader implement Iterator
|
FYI, I am trying to summarize how DataFusion iterates over data, and I came up with this summary. |
This is a small conceptual change, but of fundamental importance: We can now build the iterator API on top of next_batch, as well as a futures::Stream trait.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this will be a breaking change for any code that currently uses RecordBatchReader I think it is a significant improvement and I vote (not that I am sure what my vote counts for :) ) that it is merged in.
If the API breakage is a concern (by removing RecordBatchReader::next_batch I have a suggestion, inline, that could potentially mitigate that)
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Thanks @jorgecarleitao
This PR is built on top of #8225 and Replaces `Arc<Mutex<dyn ...>>` by `Box<dyn ...>`. In the TopK example, I had to move some functions away from the `impl`. This is because `self` cannot be borrowed as mutable and immutable at the same time, and, during iteration, it was being borrowed as mutable (to update the BTree) and as immutable (to access the `input`). There is probably a better way of achieving this e.g. via interior mutability. Closes #8307 from jorgecarleitao/box_iterator Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Andy Grove <[email protected]>
This is a proposal to change how we programmatically iterate over record batches in arrow and datafusion.
Instead of
use
I.e. via the
Iteratortrait.This allow us to write more expressive code, as well as offer a well documented and popular API to our users (Iterator).
Finally, this change also opens the possibility to implement
future::Stream, the async version ofIterator.