ARROW-10249: [Rust] Support nested dictionaries inside list arrays #8430

vertexclique · 2020-10-10T20:31:33Z

Support nested dictionaries inside list arrays for arrow JSON reader.
This pr makes reading nested JSON fields a little bit easier by making a single reader method for the list arrays.

github-actions · 2020-10-10T20:34:22Z

https://issues.apache.org/jira/browse/ARROW-10249

vertexclique · 2020-10-12T11:55:35Z

@andygrove @paddyhoran @nevi-me Can I get a pair of eyes here?

jhorstmann · 2020-10-12T12:45:58Z

rust/arrow/src/json/reader.rs

I think splitting this method into two, one for plain Utf8 and one for dictionary encoded, would simplify the code and should allow it to work without dynamic dispatch. There is only a bit of common code converting the json into a vec (line 749) which could be extracted and reused.

I think splitting this method into two, one for plain Utf8 and one for dictionary encoded, would simplify the code and should allow it to work without dynamic dispatch.

There is no dynamic dispatch going on here. dyn ArrayBuilder is converted down to the concrete type. Dynamic dispatch is <T as ArrayBuilder>::finish(). That doesn't exist here.

There is only a bit of common code converting the json into a vec (line 749) which could be extracted and reused.

If extracted, loop contains different builders with different build methods as mentioned in:
https://github.com/apache/arrow/pull/8430/files#diff-3fa6fc3c0ee201f5a3a1a5d25d0062ffR775
So that wouldn't work, because you need code duplication with the same parameters of this method, which doesn't bring any value again.

Ok, but splitting the function would allow to remove the Box::leak and unsafe pointer casting, no?

Nope, it is inside the loop where the builder is carried over from the previous loop. If we apply recursion for nested structures sometime later, I want to keep the pointer juggling inside a single block.

nevi-me

I didn't do a detailed review, but I'm happy with the changes. It's been a while since I looked at the JSON reader, how hard/easy do you think it would be for us to support the outstanding work on https://issues.apache.org/jira/browse/ARROW-4534?

I also have the feeling that the reader might be slower than other readers. What has been your experience @vertexclique?

rust/arrow/src/json/reader.rs

vertexclique · 2020-10-12T20:56:33Z

@nevi-me

I didn't do a detailed review, but I'm happy with the changes. It's been a while since I looked at the JSON reader, how hard/easy do you think it would be for us to support the outstanding work on https://issues.apache.org/jira/browse/ARROW-4534?

Thanks! Especially on the nested reading part (https://issues.apache.org/jira/browse/ARROW-4544), it would be nice to reuse builders at entry. Having a recursive reader with a recursion_limit set would be good to go. If we go down into the iterative approach, we will explicitly generate a macro to expand on the compile-time with a depth embedded in. That might slow down to compile times and create larger binaries.

The good part of the recursive approach is that it will be limited by the stack size (but there might be growing stack implementation), where the user can increase this by hand. The bad part is that the recursion limit we have defined shouldn't hit to default stack size, and we should have a sweet spot for it.

About the other ticket that is still open (https://issues.apache.org/jira/browse/ARROW-4803). Type inference for schema might be hard at first. Although it is hard, we can do assumption based parsing by parsing that first(or +2) record's data to infer the type. But when the type is given, we can try to parse all down in iso format.

I also have the feeling that the reader might be slower than other readers. What has been your experience @vertexclique?

I didn't test the performance, since I was using this in tests, I needed it. We can create a benchmark for r/w, maybe? wdyt?

jorgecarleitao

I had a hard time understanding why we need to use unsafe, Box::leak and things like

                            &mut *(&mut builder as *mut &mut dyn ArrayBuilder
                                as *mut &mut ListBuilder<
                                    StringDictionaryBuilder<DICT_TY>,
                                >)
                        }

here.

IMO we may not be approaching this idiomatically, as we are fighting the borrow checker to have this compile. As a consequence, we are putting ourselves at risk with lifetime and dangling pointers.

I though that the unsafe code that we write was only required when we were close to the arrow format, as the datatype requires unsafe casts and pointer offsets. However, if we need to use unsafe when using builders, then IMO we have something wrong in our API.

I was unable to understand the core issue that require us to fight the checker like this. @vertexclique , could you try to explain / document the issue (in the code is fine)?

jorgecarleitao · 2020-10-14T03:52:28Z

rust/arrow/src/json/reader.rs

why do we need to break lifetime rules here with Box::leak? Isn't there a safe approach to this? If this is required, can we comment why this is required and why other approaches do not?

I share these concerns and agree that if there is a compelling argument for this we need to have it documented.

rust/arrow/src/json/reader.rs

vertexclique · 2020-10-18T17:34:16Z

@jorgecarleitao written docs.

vertexclique · 2020-10-19T12:54:06Z

@jorgecarleitao @nevi-me Can I push this out of the door peeps? I am kind of blocked by this atm.

jorgecarleitao · 2020-10-19T18:44:16Z

@vertexclique , I think that the feature is appreciated, but the implementation shows some underlying issue with the API to build arrays.

IMO, In light of this PR we should revisit the API before merging this to ensure that we can create nested structures without unsafe, and, if we need an unsafe, that we use it in a well-defined, limited scope.

If we accept this, IMO we are opening the door to all kind of usage of unsafe in high-level APIs, which is not only an anti-pattern in Rust, but also discouraged except in well defined use-cases, as it risks UB. Note that I am not criticizing this implementation in particular, but that if anyone changes this, it requires a significant care by everyone involved to ensure no UB.

In other words, this implementation adds technical debt that I (and I alone) will be unable to pay back, which means that I cannot commit at maintaining this change. I would obviously not block others from approving it and merging if they feel that they can maintain this change.

cc @paddyhoran

andygrove · 2020-10-20T14:02:42Z

@jorgecarleitao @nevi-me Can I push this out of the door peeps? I am kind of blocked by this atm.

@vertexclique When I started contributing here I had similar feelings but there is no reason you should be blocked. You can maintain a branch in your fork with all of your contributions that you need and have a git dependency on that. You could even vendor the code in your project if you need to release to crates.io

Many/most of us here are volunteers and reviewing PRs in our free time so we can't always be responsive when we have work commitments.

vertexclique · 2020-10-20T15:30:39Z

@andygrove

@vertexclique When I started contributing here I had similar feelings but there is no reason you should be blocked. You can maintain a branch in your fork with all of your contributions that you need and have a git dependency on that. You could even vendor the code in your project if you need to release to crates.io

Many/most of us here are volunteers and reviewing PRs in our free time so we can't always be responsive when we have work commitments.

Sorry if I caused a disturbance. I mean no disturbance here. We are in the same shoes altogether, since most of us are having work commitments. If I felt anyone uneasy, I don't mean that. Thanks for the friendly reminder.

…w json reader

This will slow down parsing on nested structures drastically because of RefCell overhead on every array block which is going to be appended to the builder.

jorgecarleitao

I went through this again, and thank you for changing to RefCell: that significantly helped me understand what the issue was: interior mutability. I was not even aware that what the code was doing was avoiding a RefCell.

IMO this is ready to merge.

(this part is regardless, and could fit in a future PR) You mentioned in the commit message that this drastically reduces performance. Did you quantified how much was that, or is there any benchmark that you have in mind that I can run on my own? I am asking that because that is one of the situations where unsafe can be considered based on the risks and maintenance vs performance benefits.

vertexclique · 2020-10-22T08:24:49Z

(this part is regardless, and could fit in a future PR) You mentioned in the commit message that this drastically reduces performance. Did you quantified how much was that, or is there any benchmark that you have in mind that I can run on my own? I am asking that because that is one of the situations where unsafe can be considered based on the risks and maintenance vs performance benefits.

Based on my estimations of how I wrote it, it will increase the parsing time for long record windows with large nested element size. I don't estimate how large will cause it but the main concern here is that refcell's stacked borrow and finalizer's dynamic dispatch. This can be decreased by pruning refcell by visiting the ticket that I mentioned in there to unify the interfaces. Then we will use dynamic dispatch in everywhere. Thou, dynamic dispatch can be slightly slower compared to pointer casts(or in-place transmutes) it can solve the overhead of refcell. This piece of code is open to revisiting after ARROW-10335.

github-actions bot added the Component: Rust label Oct 10, 2020

vertexclique changed the title ~~ARROW-10249 - Support nested dictionaries inside list arrays~~ ARROW-10249: [Rust] - Support nested dictionaries inside list arrays Oct 10, 2020

vertexclique changed the title ~~ARROW-10249: [Rust] - Support nested dictionaries inside list arrays~~ ARROW-10249: [Rust] Support nested dictionaries inside list arrays Oct 10, 2020

vertexclique force-pushed the ARROW-10249-support-nested-dictionaries-in-list-arrays branch from 7d03f81 to 30d2840 Compare October 10, 2020 20:33

jhorstmann reviewed Oct 12, 2020

View reviewed changes

nevi-me approved these changes Oct 12, 2020

View reviewed changes

rust/arrow/src/json/reader.rs Outdated Show resolved Hide resolved

jorgecarleitao reviewed Oct 14, 2020

View reviewed changes

kszucs force-pushed the ARROW-10249-support-nested-dictionaries-in-list-arrays branch from e47c1ba to 13bd401 Compare October 19, 2020 22:29

vertexclique added 5 commits October 21, 2020 17:29

ARROW-10249 - Support nested dictionaries inside list arrays for arro…

160c244

…w json reader

Use explicit pointer cast

60055af

Reword the comment

26f74cd

Write docs for methods

afc070d

Remove unsafe code

a0b488c

This will slow down parsing on nested structures drastically because of RefCell overhead on every array block which is going to be appended to the builder.

vertexclique force-pushed the ARROW-10249-support-nested-dictionaries-in-list-arrays branch from 280d706 to a0b488c Compare October 21, 2020 15:29

jorgecarleitao approved these changes Oct 22, 2020

View reviewed changes

jorgecarleitao closed this in 9cb0e9f Oct 23, 2020

asfimport mentioned this pull request Oct 23, 2020

[Rust]: Support Dictionary types for ListArrays in arrow json reader #26246

Closed

ARROW-10249: [Rust] Support nested dictionaries inside list arrays #8430

ARROW-10249: [Rust] Support nested dictionaries inside list arrays #8430

Uh oh!

Conversation

vertexclique commented Oct 10, 2020

Uh oh!

github-actions bot commented Oct 10, 2020

Uh oh!

vertexclique commented Oct 12, 2020

Uh oh!

jhorstmann Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

vertexclique Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhorstmann Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

vertexclique Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

nevi-me left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vertexclique commented Oct 12, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Oct 14, 2020

Choose a reason for hiding this comment

Uh oh!

andygrove Oct 20, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vertexclique commented Oct 18, 2020

Uh oh!

vertexclique commented Oct 19, 2020

Uh oh!

jorgecarleitao commented Oct 19, 2020

Uh oh!

andygrove commented Oct 20, 2020

Uh oh!

vertexclique commented Oct 20, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

vertexclique commented Oct 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vertexclique Oct 12, 2020 •

edited

Loading