-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9761: [C/C++] Add experimental C stream inferface #8052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @wesm |
cpp/src/arrow/c/abi.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
silly question is errno-compatible a well defined unix/windows term? someplace where I can read more about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to phrase it: it returns value that are errno error codes (in case of error). A number of values are standard in C++: https://en.cppreference.com/w/cpp/error/errno_macros
cpp/src/arrow/c/abi.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit ntew line here or remove the new line above?
cpp/src/arrow/c/abi.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are arrays produced by this stream tied to the lifecycle of this stream (must they be released first?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I'd say no.
cpp/src/arrow/c/abi.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
released is something defined in the ArrowArray-spec. Is it stronger or weaker guarantee then returning a nullptr here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nullptr cannot be returned. The callback returns an int. However, we could say that returning -1 means end of stream.
|
I have one suggestion: The proposal indicates that the stream is finished when the array resulting from get_next is released. This seems a bit odd, how about just setting its length to 0? Or is it possible in the stream API that individual chunks are of length 0 but subsequent chunks are not? |
|
It's possible to have some chunks with 0 length in a stream, yes. I don't see any reason to forbid it in this API (and such corner cases are often annoying to deal with). |
|
However, as said above, perhaps returning |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. The next steps would be to integrate this in pyarrow and do the plumbing to get something working end-to-end with a third party project (e.g. DuckDB)
ec7c5aa to
1064c70
Compare
1064c70 to
04c25d8
Compare
|
I added some docs, please take a look. |
04c25d8 to
b71d965
Compare
|
Rebased. Does someone want to review the doc at https://pitrou.net/arrowdevdoc/format/CStreamInterface.html ? |
I took a quick look and seemed reasonable (nothing jumped out at me that needed changes). |
b71d965 to
61e2c5d
Compare
|
Rebased again. We can probably merge soon, if CI agrees. |
|
One thing we noticed: There is no way to |
|
Would |
I would think so, no? |
|
Ok, I just wanted to make sure you weren't looking to go back a number of rows given as argument :-) |
|
I think JDBC might in some cases allow for "scrolling" type of rewind vs restart (not saying this should support it but it is worth mentioning. |
|
Each environment (Python, C++, JDBC, etc.) has its own feature set for streams / iterators, which doesn't make our task easy :-/ |
|
Starting simple sounds good to me. At least for JDBC, I don't think it is required for driver implementations to support scrolling. |
|
Rewinding doesn't strike me as something which needs to be part of the C stream protocol. APIs can still provide rewind and other semantics while using a simple-as-possible stream as a building block. In the case of a SQL view which can be viewed multiple times for example, let the SQL view need not be a stream. It could instead be a function returning streams (each beginning at the start of the view): -ArrowArrayStream* MakeSqlView(...);
+SqlView* MakeSqlView(...);
+ArrowArrayStream* GetStream(SqlView*);Rewind, scrolling, offsets, reverse iteration, etc can all be accommodated in this fashion so IMHO they don't belong in the protocol. |
|
Another thing that occurred to me is whether we want to enable batch-level metadata (which would be implementation-defined). This is supported in Flight for example https://github.com/apache/arrow/blob/master/format/Flight.proto#L316 |
The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets). The trickiest part is error reporting. This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details).
61e2c5d to
9ade9a9
Compare
9ade9a9 to
87e8f73
Compare
|
I don't think it makes sense to continue hesitating on the API. I think I'm going to merge the PR as-is. The interface is marked experimental so can still evolve (or even be removed). |
The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets).
The trickiest part is error reporting. This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details).