Skip to content

Conversation

@pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Jan 18, 2019

The UnionBuilder API is still pretty rough, feedback on how it should look is appreciated. It also raises the question on what the best way to do type inference (i.e. which type will the UnionArray have?). Having this in the builder would be super useful though also for other applications where the schema is not known in advance.

@codecov-io
Copy link

codecov-io commented Jan 18, 2019

Codecov Report

Merging #3423 into master will increase coverage by 0.81%.
The diff coverage is 90.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3423      +/-   ##
==========================================
+ Coverage   88.23%   89.04%   +0.81%     
==========================================
  Files         635      509     -126     
  Lines       80181    71122    -9059     
  Branches     1069        0    -1069     
==========================================
- Hits        70751    63334    -7417     
+ Misses       9317     7788    -1529     
+ Partials      113        0     -113
Impacted Files Coverage Δ
cpp/src/arrow/python/serialize.h 0% <ø> (ø) ⬆️
cpp/src/arrow/array/builder_union.cc 100% <100%> (ø)
cpp/src/arrow/array/builder_union.h 100% <100%> (ø)
cpp/src/arrow/array/builder_nested.cc 97.4% <100%> (+0.34%) ⬆️
cpp/src/arrow/python/serialize.cc 84.09% <81.91%> (-3.68%) ⬇️
cpp/src/arrow/python/deserialize.cc 90.43% <96.55%> (+3.76%) ⬆️
cpp/src/plasma/thirdparty/dlmalloc.c 38.29% <0%> (-7%) ⬇️
python/pyarrow/plasma.py 57.14% <0%> (-1.2%) ⬇️
cpp/src/plasma/thirdparty/ae/ae.c 71.09% <0%> (-0.95%) ⬇️
cpp/src/parquet/column_reader.cc 90.3% <0%> (-0.16%) ⬇️
... and 143 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de84293...0c6ceda. Read the comment docs.

@wesm
Copy link
Member

wesm commented Jan 19, 2019

rebased. I had to squash because the prior merge commit made rebasing on master impossible

@wesm wesm force-pushed the union-builder branch 2 times, most recently from f678628 to 52cabd7 Compare January 20, 2019 16:50
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice refactor. Please find a couple remarks below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does it happen? The same initialization already happens in ListBuilder::ListBuilder above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is for the case in which types are inferred at runtime.

Especially for nested builders and union builders, that's a very useful feature (potentially also for parsing CSV files and other operations where we don't know the types beforehand; sometimes it can be precomputed but espeically if there is a deeply nested data structure, that might be unnatural like for this PR). The way I have implemented it for this PR is by specifying a "nullptr" instead of a type, in which case the type will be computed upon Finish (and it will have the correct type including all the nested types). This is what this line is doing.

The way it is currently done in the PR shouldn't be merged, but we should have a good way to do this. There are two possibilities that come to my mind (they are all not super great, I hope you can come up with a better one):

  • Introduce and special type (Type::DYNAMIC or TYPE::BUILDING) which only be a type of a Builder, not an Array. If the Builder has this type, the real type will be inferred upon Finish(). This is compatible with the current API, but introduces this awkward type (and we have to be careful people don't use it for normal Arrays).

  • Change the builder API and remove the type() method. This might be too intrusive.

  • Do it similar to the way I'm doing it right now, and have a check in the type() method that the type is not nullptr.

Let me know about your thoughts @pitrou @wesm @xhochy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility I had considered is to have a boolean flag for the constructor of the builder that makes the type inference dynamic (in which case the type() method will check fail if it is called before the builder is finished).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you may have misunderstood my question above. To rephrase it: the type is already inferred in the ListBuilder constructor, why is it needed to infer it again here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this is to not infer the type in the constructors, but in the Finish method. For more complex array builders like UnionBuilder, that's the first time we can reasonably know what the type is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would preferably be done at constructor time, so that builder->type() returns something valid.

Also, I think you should add a StructBuilder constructor without a type argument, to reflect the fact that the type is now optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this needs some more discussion I think, see my above comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it would be better to do this inference in the StructBuilder constructor, just like we do for ListBuilder. Is there a reason not to do it?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! I think this is mostly good to go. Just a couple small issues remaining.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DenseUnionBuilder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the recommended way to pass the templated callables is to use "forwarding references" a.k.a. "universal references", in other words:

..., CreateSequenceFn&& create_sequence, SetItemFn&& set_item, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check for errors after this (especially, creating a large set or tuple might fail with a MemoryError).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other set_item lambdas steal a reference to item, but PySet_Add doesn't, so you need to Py_DECREF(item) by yourself here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch! Should be fixed now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the name MAX is a good idea, it might be a preprocessor-defined macro on some systems...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed it, good catch

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Jan 29, 2019

So I resolved the comments and also made the type inference for the builders more explicit. My preferred solution here in the longer term would be to make type inference the default (at least for non-primitive builders). What do you think?

@pitrou
Copy link
Member

pitrou commented Jan 29, 2019

My preferred solution here in the longer term would be to make type inference the default (at least for non-primitive builders).

In which context?

@pcmoritz
Copy link
Contributor Author

relating to this comment: #3423 (comment)

@pitrou
Copy link
Member

pitrou commented Jan 29, 2019

Btw, the Travis-CI failure is unrelated, you can rebase from master to get it fixed.

@pcmoritz
Copy link
Contributor Author

rebased

@pcmoritz
Copy link
Contributor Author

This is now ready to merge

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still two questions I don't remember getting an answer to :-)

}

// If the type has not been specified in the constructor, infer it
if (!arrow::internal::checked_cast<ListType&>(*type_).value_type()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask again: why are you doing this, since it is already done in the constructor? (see above)
Is there a situation where the ListBuilder constructor isn't called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is for the case where type_.value_type() has been initialized to nullptr in the constructor (i.e. the type wasn't known when the ListBuilder was constructed so it was nullptr there), so it will be computed in the Finish method. This is used if the type is inferred automatically during Finish. This requires btw that .Finish of the value builder is called first (which it is). Does that make sense?

Copy link
Member

@pitrou pitrou Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh.... Do you mean when value_builder_ itself is a type-inferred union builder? If so, can you add a comment? :-) Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the comment :)

RETURN_NOT_OK(children_[i]->FinishInternal(&child_data[i]));
}

// If the type has not been specified in the constructor, infer it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I wonder why this couldn't be made in the StructBuilder constructor, where the fields are already known.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, if the types associated to the StructBuilder were nullptr at construction time, they will be computed when .Finish is called.

@pcmoritz
Copy link
Contributor Author

Maybe I can make this point a little bit clearer:

This PR makes it so that for any of the complex builders (Struct, List, Union), you can specify "nullptr" as the type during construction of the builder, then just build the nested arrays as you go along, and at the end (in the Finish method), the builder will recursively compute the right types for the resulting array.

This is distinct to the current mode of operation, where all types need to be known upfront and can be very useful especially for constructing nested arrays where computing the types in advance is hard.

Again I'm not married to the particular API, in fact I think using nullptr as the type is a bit dangerous. I proposed some other ways to do it in #3423 (comment) and the followup post (and implemented one of them in 39982cd).

If this is too controversial to discuss in this PR, I can open a mailing list thread.

@pitrou pitrou closed this in aa43784 Jan 31, 2019
@pitrou
Copy link
Member

pitrou commented Jan 31, 2019

Thanks @pcmoritz !

@pcmoritz
Copy link
Contributor Author

Thanks @pitrou !

pcmoritz added a commit to pcmoritz/arrow that referenced this pull request Feb 4, 2019
The UnionBuilder API is still pretty rough, feedback on how it should look is appreciated. It also raises the question on what the best way to do type inference (i.e. which type will the UnionArray have?). Having this in the builder would be super useful though also for other applications where the schema is not known in advance.

Author: Philipp Moritz <[email protected]>

Closes apache#3423 from pcmoritz/union-builder and squashes the following commits:

0c6ceda <Philipp Moritz> add comment
9d5fcc2 <Philipp Moritz> lint
39982cd <Philipp Moritz> remove infer_type
745ed58 <Philipp Moritz> more comments
d5e2300 <Philipp Moritz> comments
9697e69 <Philipp Moritz> dynamic_cast -> checked_cast
0ae5b66 <Philipp Moritz> fixes
a82720c <Philipp Moritz> add documentation
91d08e3 <Philipp Moritz> remove macro
ff7e052 <Philipp Moritz> use VisitIterable
2b7e4b9 <Philipp Moritz> more cleanups
1db3c21 <Philipp Moritz> cleanups
7b41b05 <Philipp Moritz> starting to refactor UnionBuilder
7488834 <Philipp Moritz> use enum for types, type constructors
80c9838 <Philipp Moritz> clean up history for union builder branch
@robertnishihara robertnishihara deleted the union-builder branch February 4, 2019 20:52
robertnishihara pushed a commit that referenced this pull request Feb 5, 2019
…bjects

This is a regression from #3423, the recursion depth was incremented for arrays and dicts but not for lists, tuples and sets.

I also added a regression test for this.

Author: Philipp Moritz <[email protected]>

Closes #3556 from pcmoritz/recursive-serialization and squashes the following commits:

f83cff9 <Philipp Moritz> fix
903788e <Philipp Moritz> fix serialization of objects that contain themselves
xhochy pushed a commit that referenced this pull request Feb 8, 2019
…bjects

This is a regression from #3423, the recursion depth was incremented for arrays and dicts but not for lists, tuples and sets.

I also added a regression test for this.

Author: Philipp Moritz <[email protected]>

Closes #3556 from pcmoritz/recursive-serialization and squashes the following commits:

f83cff9 <Philipp Moritz> fix
903788e <Philipp Moritz> fix serialization of objects that contain themselves
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants