Skip to content

Conversation

@carols10cents
Copy link
Contributor

In this commit, I:

  • Extracted a build_field function for some code shared between
    schema_to_fb and schema_to_fb_offset that needed to change

  • Uncommented the dictionary field from the Arrow schema roundtrip test
    and add a dictionary field to the IPC roundtrip test

  • If a field is a dictionary field, call add_dictionary with the
    dictionary field information on the flatbuffer field, building the
    dictionary as the C++ code does and describe with the
    same comment

  • When getting the field type for a dictionary field, use the value_type
    as the C++ code does and describe with the same
    comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is already supported.

In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127
@github-actions
Copy link

@carols10cents
Copy link
Contributor Author

Aaaand I see now that the ticket includes writing the dictionary data, while this PR only writes the dictionary schema. Well, this is part of the work needed anyway... getting to the rest of it now 🤦‍♀️

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @carols10cents, I have a comment re dictionary signed support.

From looking at the changes, they're related to Arrow's IPC file/stream format, and Parquet; which are independent. It's not an issue as we're getting to progress on both on this PR :)

We might not be merging the changes on this branch into the main branch in time for 2.0.0 (est October 9), so if a feature that we work on ends up touching IPC code in immediately beneficial ways (we still have a lot to cover), we could merge it against the main branch, then rebase this parquet one into it.

// validated elsewhere, and can safely assume we are dealing with signed
// integers
let mut index_builder = ipc::IntBuilder::new(fbb);
index_builder.add_is_signed(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although signed indices are preferred, we do support unsigned dictionary indices in the data types, so we might need to populate this based on the index_type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carols10cents we can address this in future, I'm merging this

@carols10cents
Copy link
Contributor Author

Oops, I... thought I knew what I was doing a little bit, and I clearly don't at all 😅

I think I went down this path because rust/parquet/src/arrow/schema.rs had the commented-out dictionary field that was added to this branch in this test:

#[test]
    fn test_arrow_schema_roundtrip() -> Result<()> {
        // This tests the roundtrip of an Arrow schema
        // Fields that are commented out fail roundtrip tests or are unsupported by the writer

and I thought "the writer" in this comment and exercised by this test meant the Arrow -> Parquet writer. But it's the changes in rust/arrow/src/ipc/convert.rs that get this test to pass, so I'm not sure where I got turned around...

Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data? And that neither of those are done yet?

I'm also unclear about what I should do with this PR based on your comment; should I close it for now or fix the signed index type issue and reopen it against master?

@carols10cents
Copy link
Contributor Author

Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data?

Ok well I just found https://github.com/apache/arrow/compare/rust-parquet-arrow-writer#diff-b1d078f4cb4841c220e8a274543ef616R356-R360 which seems to indicate no, there isn't actually schema conversion that needs to be done :(

@carols10cents
Copy link
Contributor Author

@nevi-me Regarding

We might not be merging the changes on this branch into the main branch in time for 2.0.0 (est October 9)

How can I best help with getting the rust-parquet-arrow-writer branch ready to be merged into master? I see you're concerned about test coverage, and I think writing more tests would help me to understand all the things I clearly don't 😅 What parts need test coverage the most, in your opinion?

@nevi-me
Copy link
Contributor

nevi-me commented Oct 3, 2020

Hi @carols10cents, my apologies for the delay.

Am I at least correct that adding dictionary support will involve adding support for the dictionary schema as well as the dictionary data? And that neither of those are done yet?

The Parquet format's dictionary semantics are in my understanding different to Arrow, in that we could have a StringArray in Arrow that gets saved with Parquet's dictionary encoding. I'm still going through the rest of the Rust Parquet codebase though, so I'm not speaking from any authority.

My comment was more in passing, and generally around dictionary support in IPC, in that beyond serializing an arrow::datatypes::Schema to arrow::ipc::Schema, there might be little in relation between the file formats.

@nevi-me
Copy link
Contributor

nevi-me commented Oct 3, 2020

How can I best help with getting the rust-parquet-arrow-writer branch ready to be merged into master?

The main blocker right now is computing nested definition and repetition levels for deeply nested arrays (see https://github.com/apache/arrow/tree/master/cpp/src/parquet/arrow for the CPP implementation). I've been working on that on and off, due to time constraints (and the need to spend a good chunk of time per session; which I haven't had much of).

Perhaps I can commit my work so you can have a look at that. I'll also open more JIRAs for work I believe we'd need to cover.

Writing more test cases will also def help us.

@nevi-me
Copy link
Contributor

nevi-me commented Oct 3, 2020

Merged

@nevi-me nevi-me closed this Oct 3, 2020
nevi-me pushed a commit that referenced this pull request Oct 3, 2020
In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes #8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
@carols10cents carols10cents deleted the rust-parquet-arrow-writer branch October 5, 2020 12:22
nevi-me pushed a commit that referenced this pull request Oct 7, 2020
In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes #8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me pushed a commit that referenced this pull request Oct 12, 2020
In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes #8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me pushed a commit that referenced this pull request Oct 16, 2020
In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes #8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me pushed a commit to nevi-me/arrow that referenced this pull request Oct 17, 2020
In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes apache#8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me pushed a commit that referenced this pull request Oct 25, 2020
In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes #8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me pushed a commit that referenced this pull request Oct 27, 2020
In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes #8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me pushed a commit that referenced this pull request Oct 28, 2020
In this commit, I:

- Extracted a `build_field` function for some code shared between
`schema_to_fb` and `schema_to_fb_offset` that needed to change

- Uncommented the dictionary field from the Arrow schema roundtrip test
and add a dictionary field to the IPC roundtrip test

- If a field is a dictionary field, call `add_dictionary` with the
dictionary field information on the flatbuffer field, building the
dictionary as [the C++ code does][cpp-dictionary] and describe with the
same comment

- When getting the field type for a dictionary field, use the `value_type`
as [the C++ code does][cpp-value-type] and describe with the same
comment

The tests pass because the Parquet -> Arrow conversion for dictionaries
is [already supported][parquet-to-arrow].

[cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440
[cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667
[parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127

Closes #8291 from carols10cents/rust-parquet-arrow-writer

Authored-by: Carol (Nichols || Goulding) <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants