Skip to content

Conversation

@vertexclique
Copy link
Contributor

Support nested dictionaries inside list arrays for arrow JSON reader.
This pr makes reading nested JSON fields a little bit easier by making a single reader method for the list arrays.

@vertexclique vertexclique changed the title ARROW-10249 - Support nested dictionaries inside list arrays ARROW-10249: [Rust] - Support nested dictionaries inside list arrays Oct 10, 2020
@vertexclique vertexclique changed the title ARROW-10249: [Rust] - Support nested dictionaries inside list arrays ARROW-10249: [Rust] Support nested dictionaries inside list arrays Oct 10, 2020
@vertexclique vertexclique force-pushed the ARROW-10249-support-nested-dictionaries-in-list-arrays branch from 7d03f81 to 30d2840 Compare October 10, 2020 20:33
@github-actions
Copy link

@vertexclique
Copy link
Contributor Author

@andygrove @paddyhoran @nevi-me Can I get a pair of eyes here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think splitting this method into two, one for plain Utf8 and one for dictionary encoded, would simplify the code and should allow it to work without dynamic dispatch. There is only a bit of common code converting the json into a vec (line 749) which could be extracted and reused.

Copy link
Contributor Author

@vertexclique vertexclique Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think splitting this method into two, one for plain Utf8 and one for dictionary encoded, would simplify the code and should allow it to work without dynamic dispatch.

There is no dynamic dispatch going on here. dyn ArrayBuilder is converted down to the concrete type. Dynamic dispatch is <T as ArrayBuilder>::finish(). That doesn't exist here.

There is only a bit of common code converting the json into a vec (line 749) which could be extracted and reused.

If extracted, loop contains different builders with different build methods as mentioned in:
https://github.com/apache/arrow/pull/8430/files#diff-3fa6fc3c0ee201f5a3a1a5d25d0062ffR775
So that wouldn't work, because you need code duplication with the same parameters of this method, which doesn't bring any value again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but splitting the function would allow to remove the Box::leak and unsafe pointer casting, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it is inside the loop where the builder is carried over from the previous loop. If we apply recursion for nested structures sometime later, I want to keep the pointer juggling inside a single block.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do a detailed review, but I'm happy with the changes. It's been a while since I looked at the JSON reader, how hard/easy do you think it would be for us to support the outstanding work on https://issues.apache.org/jira/browse/ARROW-4534?

I also have the feeling that the reader might be slower than other readers. What has been your experience @vertexclique?

@vertexclique
Copy link
Contributor Author

@nevi-me

I didn't do a detailed review, but I'm happy with the changes. It's been a while since I looked at the JSON reader, how hard/easy do you think it would be for us to support the outstanding work on https://issues.apache.org/jira/browse/ARROW-4534?

Thanks! Especially on the nested reading part (https://issues.apache.org/jira/browse/ARROW-4544), it would be nice to reuse builders at entry. Having a recursive reader with a recursion_limit set would be good to go. If we go down into the iterative approach, we will explicitly generate a macro to expand on the compile-time with a depth embedded in. That might slow down to compile times and create larger binaries.

The good part of the recursive approach is that it will be limited by the stack size (but there might be growing stack implementation), where the user can increase this by hand. The bad part is that the recursion limit we have defined shouldn't hit to default stack size, and we should have a sweet spot for it.

About the other ticket that is still open (https://issues.apache.org/jira/browse/ARROW-4803). Type inference for schema might be hard at first. Although it is hard, we can do assumption based parsing by parsing that first(or +2) record's data to infer the type. But when the type is given, we can try to parse all down in iso format.

I also have the feeling that the reader might be slower than other readers. What has been your experience @vertexclique?

I didn't test the performance, since I was using this in tests, I needed it. We can create a benchmark for r/w, maybe? wdyt?

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a hard time understanding why we need to use unsafe, Box::leak and things like

                            &mut *(&mut builder as *mut &mut dyn ArrayBuilder
                                as *mut &mut ListBuilder<
                                    StringDictionaryBuilder<DICT_TY>,
                                >)
                        }

here.

IMO we may not be approaching this idiomatically, as we are fighting the borrow checker to have this compile. As a consequence, we are putting ourselves at risk with lifetime and dangling pointers.

I though that the unsafe code that we write was only required when we were close to the arrow format, as the datatype requires unsafe casts and pointer offsets. However, if we need to use unsafe when using builders, then IMO we have something wrong in our API.

I was unable to understand the core issue that require us to fight the checker like this. @vertexclique , could you try to explain / document the issue (in the code is fine)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to break lifetime rules here with Box::leak? Isn't there a safe approach to this? If this is required, can we comment why this is required and why other approaches do not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share these concerns and agree that if there is a compelling argument for this we need to have it documented.

@vertexclique
Copy link
Contributor Author

@jorgecarleitao written docs.

@vertexclique
Copy link
Contributor Author

@jorgecarleitao @nevi-me Can I push this out of the door peeps? I am kind of blocked by this atm.

@jorgecarleitao
Copy link
Member

@vertexclique , I think that the feature is appreciated, but the implementation shows some underlying issue with the API to build arrays.

IMO, In light of this PR we should revisit the API before merging this to ensure that we can create nested structures without unsafe, and, if we need an unsafe, that we use it in a well-defined, limited scope.

If we accept this, IMO we are opening the door to all kind of usage of unsafe in high-level APIs, which is not only an anti-pattern in Rust, but also discouraged except in well defined use-cases, as it risks UB. Note that I am not criticizing this implementation in particular, but that if anyone changes this, it requires a significant care by everyone involved to ensure no UB.

In other words, this implementation adds technical debt that I (and I alone) will be unable to pay back, which means that I cannot commit at maintaining this change. I would obviously not block others from approving it and merging if they feel that they can maintain this change.

cc @paddyhoran

@kszucs kszucs force-pushed the ARROW-10249-support-nested-dictionaries-in-list-arrays branch from e47c1ba to 13bd401 Compare October 19, 2020 22:29
@andygrove
Copy link
Member

@jorgecarleitao @nevi-me Can I push this out of the door peeps? I am kind of blocked by this atm.

@vertexclique When I started contributing here I had similar feelings but there is no reason you should be blocked. You can maintain a branch in your fork with all of your contributions that you need and have a git dependency on that. You could even vendor the code in your project if you need to release to crates.io

Many/most of us here are volunteers and reviewing PRs in our free time so we can't always be responsive when we have work commitments.

@vertexclique
Copy link
Contributor Author

@andygrove

@vertexclique When I started contributing here I had similar feelings but there is no reason you should be blocked. You can maintain a branch in your fork with all of your contributions that you need and have a git dependency on that. You could even vendor the code in your project if you need to release to crates.io

Many/most of us here are volunteers and reviewing PRs in our free time so we can't always be responsive when we have work commitments.

Sorry if I caused a disturbance. I mean no disturbance here. We are in the same shoes altogether, since most of us are having work commitments. If I felt anyone uneasy, I don't mean that. Thanks for the friendly reminder.

This will slow down parsing on nested structures drastically because of RefCell overhead on every array block
which is going to be appended to the builder.
@vertexclique vertexclique force-pushed the ARROW-10249-support-nested-dictionaries-in-list-arrays branch from 280d706 to a0b488c Compare October 21, 2020 15:29
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this again, and thank you for changing to RefCell: that significantly helped me understand what the issue was: interior mutability. I was not even aware that what the code was doing was avoiding a RefCell.

IMO this is ready to merge.

(this part is regardless, and could fit in a future PR) You mentioned in the commit message that this drastically reduces performance. Did you quantified how much was that, or is there any benchmark that you have in mind that I can run on my own? I am asking that because that is one of the situations where unsafe can be considered based on the risks and maintenance vs performance benefits.

@vertexclique
Copy link
Contributor Author

(this part is regardless, and could fit in a future PR) You mentioned in the commit message that this drastically reduces performance. Did you quantified how much was that, or is there any benchmark that you have in mind that I can run on my own? I am asking that because that is one of the situations where unsafe can be considered based on the risks and maintenance vs performance benefits.

Based on my estimations of how I wrote it, it will increase the parsing time for long record windows with large nested element size. I don't estimate how large will cause it but the main concern here is that refcell's stacked borrow and finalizer's dynamic dispatch. This can be decreased by pruning refcell by visiting the ticket that I mentioned in there to unify the interfaces. Then we will use dynamic dispatch in everywhere. Thou, dynamic dispatch can be slightly slower compared to pointer casts(or in-place transmutes) it can solve the overhead of refcell. This piece of code is open to revisiting after ARROW-10335.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants