ARROW-10299: [Rust] Use IPC Metadata V5 as default #9122

nevi-me · 2021-01-07T08:36:23Z

We have been using the legacy IPC format, which predates v1.0 of the crate. This PR changes to use the latest version, ipc::MetadataVersion::V5 from v3.0 of the crate.

The main change was to change the default IpcWriteOptions, and add tests

nevi-me · 2021-01-07T08:38:05Z

@carols10cents FYI

@andygrove here's the 3.0.0 blocker

github-actions · 2021-01-07T08:40:07Z

https://issues.apache.org/jira/browse/ARROW-10299

codecov-io · 2021-01-07T10:54:09Z

Codecov Report

Merging #9122 (8ccc5bd) into master (a0e1244) will increase coverage by 0.09%.
The diff coverage is 99.06%.

@@            Coverage Diff             @@
##           master    #9122      +/-   ##
==========================================
+ Coverage   81.64%   81.74%   +0.09%     
==========================================
  Files         215      215              
  Lines       52489    52572      +83     
==========================================
+ Hits        42857    42975     +118     
+ Misses       9632     9597      -35

Impacted Files	Coverage Δ
rust/arrow/src/array/array_struct.rs	`88.43% <ø> (ø)`
rust/arrow/src/util/integration_util.rs	`69.95% <75.00%> (+3.08%)`	⬆️
rust/arrow/src/ipc/reader.rs	`84.56% <100.00%> (+1.41%)`	⬆️
rust/arrow/src/ipc/writer.rs	`87.82% <100.00%> (+4.60%)`	⬆️
rust/parquet/src/encodings/encoding.rs	`95.43% <0.00%> (+0.19%)`	⬆️
rust/arrow/src/ipc/convert.rs	`92.56% <0.00%> (+0.45%)`	⬆️
rust/arrow/src/array/null.rs	`92.59% <0.00%> (+1.85%)`	⬆️
rust/arrow/src/array/array.rs	`86.40% <0.00%> (+1.94%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0e1244...8ccc5bd. Read the comment docs.

mqy · 2021-01-07T12:44:37Z

rust/arrow/src/array/array_struct.rs


    /// Return child array whose field name equals to column_name
+    ///
+    /// Note: The Arrow specification allows for duplicate field names, and in such


I didn't find about allows for duplicate filed names from https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc

There's a duplicate_field_name test, which fails because we always get the first named field, perhaps it hasn't been codified into the spec. I'll check.

By allowing duplicate field name is rather strange and uncertain, perhaps that file is used for error checking purpose.

From the IPC doc: A struct is a nested type parameterized by an ordered sequence of types (which can all be distinct), called its fields. Each field must have a UTF8-encoded name, and these field names are part of the type metadata.

The which can all be distinct is totally wasting of time, isn't it?

From C++: https://github.com/apache/arrow/blob/57376d28cf433bed95f19fa44c1e90a780ba54e8/cpp/src/arrow/type.cc

// From this point, there's one or more field in the builder that exists with // the same name. if (policy_ == CONFLICT_IGNORE) { // The ignore policy is more generous when there's duplicate in the builder. return Status::OK(); } else if (policy_ == CONFLICT_ERROR) { return Status::Invalid("Duplicate found, policy dictate to treat as an error"); }

From java https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractStructVector.java

/** * Policy to determine how to react when duplicate columns are encountered. */ public enum ConflictPolicy { // Ignore the conflict and append the field. This is the default behaviour CONFLICT_APPEND, // Keep the existing field and ignore the newer one. CONFLICT_IGNORE, // Replace the existing field with the newer one. CONFLICT_REPLACE, // Refuse the new field and error out. CONFLICT_ERROR }

I wasn't aware of this, thanks for finding the references. This could explain why the duplicate_field_names integration tests fail

It is not acceptable if we silently append/ignore/replace duplicate fields. Even if we can managed to let user to configure this global behavior, it may not satisfy all possible duplications even in a single file: some one want to append but another want to ignore. So, at present before we can find the solution to let user define the behavior, I suggest raising error on duplication.

That sounds like a good idea, may you please open a JIRA for the work; even if we don't get to complete it in time for 3.0.0.

I also prefer raising an error by default, as that'll make users aware very quickly

Recorded at https://issues.apache.org/jira/browse/ARROW-11178

We had a similar discussion on DataFusion for RecordBatch a while back. We now just refuse duplicated column names, as it offers a much, much simpler way of dealing with columns via column names.

However, sometimes people have to parse files with equal column names, which means that arrow somehow has to support them on RecordBatch.

mqy

Ok, this comment is fine

mqy · 2021-01-08T03:41:52Z

rust/arrow/src/array/array_struct.rs


    /// Return child array whose field name equals to column_name
+    ///
+    /// Note: The Arrow specification allows for duplicate field names, and in such


Recorded at https://issues.apache.org/jira/browse/ARROW-11178

kszucs · 2021-01-12T15:02:41Z

@nevi-me what's the status here? I assume it would be nice to include it in the release.

nevi-me · 2021-01-13T07:26:54Z

@kszucs needs review. I hadn't noticed that I had only added Andy as a reviewer.

@alamb @jorgecarleitao @paddyhoran if you have some time, may you please have a look at the PR.
I think it's low risk, and would be helpful to start supporting V5 by default.

jorgecarleitao · 2021-01-15T05:36:01Z

If still on time, I will be reviewing this over this weekend, probably Saturday.

This is an area that I am not as familiar and thus need more time ^_^

jorgecarleitao

Thanks a lot @nevi-me for taking this. Looks great. Left a small question, and IMO this has no risk of landing in 3.0

jorgecarleitao · 2021-01-16T05:18:34Z

rust/arrow/src/util/integration_util.rs

Shouldn't the null datatype contain N null values where N is the length of the array?

alamb · 2021-01-20T13:21:43Z

@nevi-me is this one ready to merge? Should we rebase it against latest master and rerun the checks? Or I can merge it in as is too

Add tests for different padding lengths update doc add null tests

nevi-me · 2021-01-21T05:03:57Z

@nevi-me is this one ready to merge? Should we rebase it against latest master and rerun the checks? Or I can merge it in as is too

I've rebased, and changed a few comments: V5 will be default in 4.0.0, pointed to the JIRA about duplicate column names.

We have been using the legacy IPC format, which predates v1.0 of the crate. This PR changes to use the latest version, `ipc::MetadataVersion::V5` from v3.0 of the crate. The main change was to change the default `IpcWriteOptions`, and add tests Closes #9122 from nevi-me/ARROW-10299 Authored-by: Neville Dipale <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

nevi-me requested a review from andygrove January 7, 2021 08:38

github-actions bot added the Component: Rust label Jan 7, 2021

nevi-me force-pushed the ARROW-10299 branch from 8096c68 to bd2e13e Compare January 7, 2021 10:34

mqy reviewed Jan 7, 2021

View reviewed changes

mqy reviewed Jan 8, 2021

View reviewed changes

nevi-me mentioned this pull request Jan 8, 2021

ARROW-8676: [Rust] IPC RecordBatch body compression #9137

Closed

nevi-me force-pushed the ARROW-10299 branch from bd2e13e to b3829c7 Compare January 9, 2021 06:53

jorgecarleitao approved these changes Jan 16, 2021

View reviewed changes

nevi-me added 2 commits January 21, 2021 06:37

ARROW-10299: [Rust] Add IPC V5 tests, set V5 as default

f10b4bd

Add tests for different padding lengths update doc add null tests

update comments

8ccc5bd

nevi-me force-pushed the ARROW-10299 branch from b3829c7 to 8ccc5bd Compare January 21, 2021 05:02

jorgecarleitao closed this in 629a6fd Jan 22, 2021

asfimport mentioned this pull request Jan 27, 2021

[Rust] Support reading and writing V5 of IPC metadata #26290

Closed

ARROW-10299: [Rust] Use IPC Metadata V5 as default #9122

ARROW-10299: [Rust] Use IPC Metadata V5 as default #9122

Uh oh!

Conversation

nevi-me commented Jan 7, 2021

Uh oh!

nevi-me commented Jan 7, 2021

Uh oh!

github-actions bot commented Jan 7, 2021

Uh oh!

codecov-io commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mqy Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mqy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kszucs commented Jan 12, 2021

Uh oh!

nevi-me commented Jan 13, 2021

Uh oh!

jorgecarleitao commented Jan 15, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jan 20, 2021

Uh oh!

nevi-me commented Jan 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-io commented Jan 7, 2021 •

edited

Loading

mqy Jan 7, 2021 •

edited

Loading