-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types #8291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types #8291
Conversation
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127
|
Aaaand I see now that the ticket includes writing the dictionary data, while this PR only writes the dictionary schema. Well, this is part of the work needed anyway... getting to the rest of it now 🤦♀️ |
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @carols10cents, I have a comment re dictionary signed support.
From looking at the changes, they're related to Arrow's IPC file/stream format, and Parquet; which are independent. It's not an issue as we're getting to progress on both on this PR :)
We might not be merging the changes on this branch into the main branch in time for 2.0.0 (est October 9), so if a feature that we work on ends up touching IPC code in immediately beneficial ways (we still have a lot to cover), we could merge it against the main branch, then rebase this parquet one into it.
| // validated elsewhere, and can safely assume we are dealing with signed | ||
| // integers | ||
| let mut index_builder = ipc::IntBuilder::new(fbb); | ||
| index_builder.add_is_signed(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although signed indices are preferred, we do support unsigned dictionary indices in the data types, so we might need to populate this based on the index_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carols10cents we can address this in future, I'm merging this
|
Oops, I... thought I knew what I was doing a little bit, and I clearly don't at all 😅 I think I went down this path because rust/parquet/src/arrow/schema.rs had the commented-out dictionary field that was added to this branch in this test: and I thought "the writer" in this comment and exercised by this test meant the Arrow -> Parquet writer. But it's the changes in rust/arrow/src/ipc/convert.rs that get this test to pass, so I'm not sure where I got turned around... Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data? And that neither of those are done yet? I'm also unclear about what I should do with this PR based on your comment; should I close it for now or fix the signed index type issue and reopen it against master? |
Ok well I just found https://github.com/apache/arrow/compare/rust-parquet-arrow-writer#diff-b1d078f4cb4841c220e8a274543ef616R356-R360 which seems to indicate no, there isn't actually schema conversion that needs to be done :( |
|
@nevi-me Regarding
How can I best help with getting the rust-parquet-arrow-writer branch ready to be merged into master? I see you're concerned about test coverage, and I think writing more tests would help me to understand all the things I clearly don't 😅 What parts need test coverage the most, in your opinion? |
|
Hi @carols10cents, my apologies for the delay.
The Parquet format's dictionary semantics are in my understanding different to Arrow, in that we could have a My comment was more in passing, and generally around dictionary support in IPC, in that beyond serializing an |
The main blocker right now is computing nested definition and repetition levels for deeply nested arrays (see https://github.com/apache/arrow/tree/master/cpp/src/parquet/arrow for the CPP implementation). I've been working on that on and off, due to time constraints (and the need to spend a good chunk of time per session; which I haven't had much of). Perhaps I can commit my work so you can have a look at that. I'll also open more JIRAs for work I believe we'd need to cover. Writing more test cases will also def help us. |
|
Merged |
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes apache#8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
In this commit, I:
Extracted a
build_fieldfunction for some code shared betweenschema_to_fbandschema_to_fb_offsetthat needed to changeUncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test
If a field is a dictionary field, call
add_dictionarywith thedictionary field information on the flatbuffer field, building the
dictionary as the C++ code does and describe with the
same comment
When getting the field type for a dictionary field, use the
value_typeas the C++ code does and describe with the same
comment
The tests pass because the Parquet -> Arrow conversion for dictionaries
is already supported.