-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10249: [Rust] Support nested dictionaries inside list arrays #8430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10249: [Rust] Support nested dictionaries inside list arrays #8430
Conversation
7d03f81 to
30d2840
Compare
|
@andygrove @paddyhoran @nevi-me Can I get a pair of eyes here? |
rust/arrow/src/json/reader.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think splitting this method into two, one for plain Utf8 and one for dictionary encoded, would simplify the code and should allow it to work without dynamic dispatch. There is only a bit of common code converting the json into a vec (line 749) which could be extracted and reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think splitting this method into two, one for plain Utf8 and one for dictionary encoded, would simplify the code and should allow it to work without dynamic dispatch.
There is no dynamic dispatch going on here. dyn ArrayBuilder is converted down to the concrete type. Dynamic dispatch is <T as ArrayBuilder>::finish(). That doesn't exist here.
There is only a bit of common code converting the json into a vec (line 749) which could be extracted and reused.
If extracted, loop contains different builders with different build methods as mentioned in:
https://github.com/apache/arrow/pull/8430/files#diff-3fa6fc3c0ee201f5a3a1a5d25d0062ffR775
So that wouldn't work, because you need code duplication with the same parameters of this method, which doesn't bring any value again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but splitting the function would allow to remove the Box::leak and unsafe pointer casting, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it is inside the loop where the builder is carried over from the previous loop. If we apply recursion for nested structures sometime later, I want to keep the pointer juggling inside a single block.
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do a detailed review, but I'm happy with the changes. It's been a while since I looked at the JSON reader, how hard/easy do you think it would be for us to support the outstanding work on https://issues.apache.org/jira/browse/ARROW-4534?
I also have the feeling that the reader might be slower than other readers. What has been your experience @vertexclique?
Thanks! Especially on the nested reading part (https://issues.apache.org/jira/browse/ARROW-4544), it would be nice to reuse builders at entry. Having a recursive reader with a The good part of the recursive approach is that it will be limited by the stack size (but there might be growing stack implementation), where the user can increase this by hand. The bad part is that the recursion limit we have defined shouldn't hit to default stack size, and we should have a sweet spot for it. About the other ticket that is still open (https://issues.apache.org/jira/browse/ARROW-4803). Type inference for schema might be hard at first. Although it is hard, we can do assumption based parsing by parsing that first(or +2) record's data to infer the type. But when the type is given, we can try to parse all down in iso format.
I didn't test the performance, since I was using this in tests, I needed it. We can create a benchmark for r/w, maybe? wdyt? |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a hard time understanding why we need to use unsafe, Box::leak and things like
&mut *(&mut builder as *mut &mut dyn ArrayBuilder
as *mut &mut ListBuilder<
StringDictionaryBuilder<DICT_TY>,
>)
}here.
IMO we may not be approaching this idiomatically, as we are fighting the borrow checker to have this compile. As a consequence, we are putting ourselves at risk with lifetime and dangling pointers.
I though that the unsafe code that we write was only required when we were close to the arrow format, as the datatype requires unsafe casts and pointer offsets. However, if we need to use unsafe when using builders, then IMO we have something wrong in our API.
I was unable to understand the core issue that require us to fight the checker like this. @vertexclique , could you try to explain / document the issue (in the code is fine)?
rust/arrow/src/json/reader.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to break lifetime rules here with Box::leak? Isn't there a safe approach to this? If this is required, can we comment why this is required and why other approaches do not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share these concerns and agree that if there is a compelling argument for this we need to have it documented.
|
@jorgecarleitao written docs. |
|
@jorgecarleitao @nevi-me Can I push this out of the door peeps? I am kind of blocked by this atm. |
|
@vertexclique , I think that the feature is appreciated, but the implementation shows some underlying issue with the API to build arrays. IMO, In light of this PR we should revisit the API before merging this to ensure that we can create nested structures without If we accept this, IMO we are opening the door to all kind of usage of In other words, this implementation adds technical debt that I (and I alone) will be unable to pay back, which means that I cannot commit at maintaining this change. I would obviously not block others from approving it and merging if they feel that they can maintain this change. cc @paddyhoran |
e47c1ba to
13bd401
Compare
@vertexclique When I started contributing here I had similar feelings but there is no reason you should be blocked. You can maintain a branch in your fork with all of your contributions that you need and have a git dependency on that. You could even vendor the code in your project if you need to release to crates.io Many/most of us here are volunteers and reviewing PRs in our free time so we can't always be responsive when we have work commitments. |
Sorry if I caused a disturbance. I mean no disturbance here. We are in the same shoes altogether, since most of us are having work commitments. If I felt anyone uneasy, I don't mean that. Thanks for the friendly reminder. |
This will slow down parsing on nested structures drastically because of RefCell overhead on every array block which is going to be appended to the builder.
280d706 to
a0b488c
Compare
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this again, and thank you for changing to RefCell: that significantly helped me understand what the issue was: interior mutability. I was not even aware that what the code was doing was avoiding a RefCell.
IMO this is ready to merge.
(this part is regardless, and could fit in a future PR) You mentioned in the commit message that this drastically reduces performance. Did you quantified how much was that, or is there any benchmark that you have in mind that I can run on my own? I am asking that because that is one of the situations where unsafe can be considered based on the risks and maintenance vs performance benefits.
Based on my estimations of how I wrote it, it will increase the parsing time for long record windows with large nested element size. I don't estimate how large will cause it but the main concern here is that refcell's stacked borrow and finalizer's dynamic dispatch. This can be decreased by pruning refcell by visiting the ticket that I mentioned in there to unify the interfaces. Then we will use dynamic dispatch in everywhere. Thou, dynamic dispatch can be slightly slower compared to pointer casts(or in-place transmutes) it can solve the overhead of refcell. This piece of code is open to revisiting after ARROW-10335. |
Support nested dictionaries inside list arrays for arrow JSON reader.
This pr makes reading nested JSON fields a little bit easier by making a single reader method for the list arrays.