-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4285: [Python] Use proper builder interface for serialization #3423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3423 +/- ##
==========================================
+ Coverage 88.23% 89.04% +0.81%
==========================================
Files 635 509 -126
Lines 80181 71122 -9059
Branches 1069 0 -1069
==========================================
- Hits 70751 63334 -7417
+ Misses 9317 7788 -1529
+ Partials 113 0 -113
Continue to review full report at Codecov.
|
|
rebased. I had to squash because the prior merge commit made rebasing on master impossible |
f678628 to
52cabd7
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a nice refactor. Please find a couple remarks below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does it happen? The same initialization already happens in ListBuilder::ListBuilder above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is for the case in which types are inferred at runtime.
Especially for nested builders and union builders, that's a very useful feature (potentially also for parsing CSV files and other operations where we don't know the types beforehand; sometimes it can be precomputed but espeically if there is a deeply nested data structure, that might be unnatural like for this PR). The way I have implemented it for this PR is by specifying a "nullptr" instead of a type, in which case the type will be computed upon Finish (and it will have the correct type including all the nested types). This is what this line is doing.
The way it is currently done in the PR shouldn't be merged, but we should have a good way to do this. There are two possibilities that come to my mind (they are all not super great, I hope you can come up with a better one):
-
Introduce and special type (Type::DYNAMIC or TYPE::BUILDING) which only be a type of a Builder, not an Array. If the Builder has this type, the real type will be inferred upon Finish(). This is compatible with the current API, but introduces this awkward type (and we have to be careful people don't use it for normal Arrays).
-
Change the builder API and remove the type() method. This might be too intrusive.
-
Do it similar to the way I'm doing it right now, and have a check in the type() method that the type is not nullptr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility I had considered is to have a boolean flag for the constructor of the builder that makes the type inference dynamic (in which case the type() method will check fail if it is called before the builder is finished).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize you may have misunderstood my question above. To rephrase it: the type is already inferred in the ListBuilder constructor, why is it needed to infer it again here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this is to not infer the type in the constructors, but in the Finish method. For more complex array builders like UnionBuilder, that's the first time we can reasonably know what the type is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would preferably be done at constructor time, so that builder->type() returns something valid.
Also, I think you should add a StructBuilder constructor without a type argument, to reflect the fact that the type is now optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this needs some more discussion I think, see my above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it would be better to do this inference in the StructBuilder constructor, just like we do for ListBuilder. Is there a reason not to do it?
0f0fcaa to
8a4bdca
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! I think this is mostly good to go. Just a couple small issues remaining.
cpp/src/arrow/array/builder_union.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DenseUnionBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
cpp/src/arrow/python/deserialize.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the recommended way to pass the templated callables is to use "forwarding references" a.k.a. "universal references", in other words:
..., CreateSequenceFn&& create_sequence, SetItemFn&& set_item, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
cpp/src/arrow/python/deserialize.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check for errors after this (especially, creating a large set or tuple might fail with a MemoryError).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
cpp/src/arrow/python/deserialize.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other set_item lambdas steal a reference to item, but PySet_Add doesn't, so you need to Py_DECREF(item) by yourself here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good catch! Should be fixed now.
cpp/src/arrow/python/serialize.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the name MAX is a good idea, it might be a preprocessor-defined macro on some systems...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed it, good catch
|
So I resolved the comments and also made the type inference for the builders more explicit. My preferred solution here in the longer term would be to make type inference the default (at least for non-primitive builders). What do you think? |
In which context? |
|
relating to this comment: #3423 (comment) |
7169f85 to
d1493bc
Compare
|
Btw, the Travis-CI failure is unrelated, you can rebase from master to get it fixed. |
f81c95a to
9d5fcc2
Compare
|
rebased |
|
This is now ready to merge |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still two questions I don't remember getting an answer to :-)
| } | ||
|
|
||
| // If the type has not been specified in the constructor, infer it | ||
| if (!arrow::internal::checked_cast<ListType&>(*type_).value_type()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll ask again: why are you doing this, since it is already done in the constructor? (see above)
Is there a situation where the ListBuilder constructor isn't called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is for the case where type_.value_type() has been initialized to nullptr in the constructor (i.e. the type wasn't known when the ListBuilder was constructed so it was nullptr there), so it will be computed in the Finish method. This is used if the type is inferred automatically during Finish. This requires btw that .Finish of the value builder is called first (which it is). Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh.... Do you mean when value_builder_ itself is a type-inferred union builder? If so, can you add a comment? :-) Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the comment :)
| RETURN_NOT_OK(children_[i]->FinishInternal(&child_data[i])); | ||
| } | ||
|
|
||
| // If the type has not been specified in the constructor, infer it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I wonder why this couldn't be made in the StructBuilder constructor, where the fields are already known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, if the types associated to the StructBuilder were nullptr at construction time, they will be computed when .Finish is called.
|
Maybe I can make this point a little bit clearer: This PR makes it so that for any of the complex builders (Struct, List, Union), you can specify "nullptr" as the type during construction of the builder, then just build the nested arrays as you go along, and at the end (in the Finish method), the builder will recursively compute the right types for the resulting array. This is distinct to the current mode of operation, where all types need to be known upfront and can be very useful especially for constructing nested arrays where computing the types in advance is hard. Again I'm not married to the particular API, in fact I think using nullptr as the type is a bit dangerous. I proposed some other ways to do it in #3423 (comment) and the followup post (and implemented one of them in 39982cd). If this is too controversial to discuss in this PR, I can open a mailing list thread. |
|
Thanks @pcmoritz ! |
|
Thanks @pitrou ! |
The UnionBuilder API is still pretty rough, feedback on how it should look is appreciated. It also raises the question on what the best way to do type inference (i.e. which type will the UnionArray have?). Having this in the builder would be super useful though also for other applications where the schema is not known in advance. Author: Philipp Moritz <[email protected]> Closes apache#3423 from pcmoritz/union-builder and squashes the following commits: 0c6ceda <Philipp Moritz> add comment 9d5fcc2 <Philipp Moritz> lint 39982cd <Philipp Moritz> remove infer_type 745ed58 <Philipp Moritz> more comments d5e2300 <Philipp Moritz> comments 9697e69 <Philipp Moritz> dynamic_cast -> checked_cast 0ae5b66 <Philipp Moritz> fixes a82720c <Philipp Moritz> add documentation 91d08e3 <Philipp Moritz> remove macro ff7e052 <Philipp Moritz> use VisitIterable 2b7e4b9 <Philipp Moritz> more cleanups 1db3c21 <Philipp Moritz> cleanups 7b41b05 <Philipp Moritz> starting to refactor UnionBuilder 7488834 <Philipp Moritz> use enum for types, type constructors 80c9838 <Philipp Moritz> clean up history for union builder branch
…bjects This is a regression from #3423, the recursion depth was incremented for arrays and dicts but not for lists, tuples and sets. I also added a regression test for this. Author: Philipp Moritz <[email protected]> Closes #3556 from pcmoritz/recursive-serialization and squashes the following commits: f83cff9 <Philipp Moritz> fix 903788e <Philipp Moritz> fix serialization of objects that contain themselves
…bjects This is a regression from #3423, the recursion depth was incremented for arrays and dicts but not for lists, tuples and sets. I also added a regression test for this. Author: Philipp Moritz <[email protected]> Closes #3556 from pcmoritz/recursive-serialization and squashes the following commits: f83cff9 <Philipp Moritz> fix 903788e <Philipp Moritz> fix serialization of objects that contain themselves
The UnionBuilder API is still pretty rough, feedback on how it should look is appreciated. It also raises the question on what the best way to do type inference (i.e. which type will the UnionArray have?). Having this in the builder would be super useful though also for other applications where the schema is not known in advance.