Skip to content

Conversation

@nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Jan 7, 2021

We have been using the legacy IPC format, which predates v1.0 of the crate. This PR changes to use the latest version, ipc::MetadataVersion::V5 from v3.0 of the crate.

The main change was to change the default IpcWriteOptions, and add tests

@nevi-me
Copy link
Contributor Author

nevi-me commented Jan 7, 2021

@carols10cents FYI

@andygrove here's the 3.0.0 blocker

@nevi-me nevi-me requested a review from andygrove January 7, 2021 08:38
@github-actions
Copy link

github-actions bot commented Jan 7, 2021

@codecov-io
Copy link

codecov-io commented Jan 7, 2021

Codecov Report

Merging #9122 (8ccc5bd) into master (a0e1244) will increase coverage by 0.09%.
The diff coverage is 99.06%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9122      +/-   ##
==========================================
+ Coverage   81.64%   81.74%   +0.09%     
==========================================
  Files         215      215              
  Lines       52489    52572      +83     
==========================================
+ Hits        42857    42975     +118     
+ Misses       9632     9597      -35     
Impacted Files Coverage Δ
rust/arrow/src/array/array_struct.rs 88.43% <ø> (ø)
rust/arrow/src/util/integration_util.rs 69.95% <75.00%> (+3.08%) ⬆️
rust/arrow/src/ipc/reader.rs 84.56% <100.00%> (+1.41%) ⬆️
rust/arrow/src/ipc/writer.rs 87.82% <100.00%> (+4.60%) ⬆️
rust/parquet/src/encodings/encoding.rs 95.43% <0.00%> (+0.19%) ⬆️
rust/arrow/src/ipc/convert.rs 92.56% <0.00%> (+0.45%) ⬆️
rust/arrow/src/array/null.rs 92.59% <0.00%> (+1.85%) ⬆️
rust/arrow/src/array/array.rs 86.40% <0.00%> (+1.94%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0e1244...8ccc5bd. Read the comment docs.


/// Return child array whose field name equals to column_name
///
/// Note: The Arrow specification allows for duplicate field names, and in such
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a duplicate_field_name test, which fails because we always get the first named field, perhaps it hasn't been codified into the spec. I'll check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By allowing duplicate field name is rather strange and uncertain, perhaps that file is used for error checking purpose.

Copy link
Contributor

@mqy mqy Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the IPC doc: A struct is a nested type parameterized by an ordered sequence of types (which can all be distinct), called its fields. Each field must have a UTF8-encoded name, and these field names are part of the type metadata.

The which can all be distinct is totally wasting of time, isn't it?

From C++: https://github.com/apache/arrow/blob/57376d28cf433bed95f19fa44c1e90a780ba54e8/cpp/src/arrow/type.cc

    // From this point, there's one or more field in the builder that exists with
    // the same name.

    if (policy_ == CONFLICT_IGNORE) {
      // The ignore policy is more generous when there's duplicate in the builder.
      return Status::OK();
    } else if (policy_ == CONFLICT_ERROR) {
      return Status::Invalid("Duplicate found, policy dictate to treat as an error");
    }

From java https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractStructVector.java

/**
  * Policy to determine how to react when duplicate columns are encountered.
  */
 public enum ConflictPolicy {
   // Ignore the conflict and append the field. This is the default behaviour
   CONFLICT_APPEND,
   // Keep the existing field and ignore the newer one.
   CONFLICT_IGNORE,
   // Replace the existing field with the newer one.
   CONFLICT_REPLACE,
   // Refuse the new field and error out.
   CONFLICT_ERROR
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of this, thanks for finding the references. This could explain why the duplicate_field_names integration tests fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not acceptable if we silently append/ignore/replace duplicate fields. Even if we can managed to let user to configure this global behavior, it may not satisfy all possible duplications even in a single file: some one want to append but another want to ignore. So, at present before we can find the solution to let user define the behavior, I suggest raising error on duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea, may you please open a JIRA for the work; even if we don't get to complete it in time for 3.0.0.

I also prefer raising an error by default, as that'll make users aware very quickly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a similar discussion on DataFusion for RecordBatch a while back. We now just refuse duplicated column names, as it offers a much, much simpler way of dealing with columns via column names.

However, sometimes people have to parse files with equal column names, which means that arrow somehow has to support them on RecordBatch.

Copy link
Contributor

@mqy mqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this comment is fine


/// Return child array whose field name equals to column_name
///
/// Note: The Arrow specification allows for duplicate field names, and in such
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kszucs
Copy link
Member

kszucs commented Jan 12, 2021

@nevi-me what's the status here? I assume it would be nice to include it in the release.

@nevi-me
Copy link
Contributor Author

nevi-me commented Jan 13, 2021

@kszucs needs review. I hadn't noticed that I had only added Andy as a reviewer.

@alamb @jorgecarleitao @paddyhoran if you have some time, may you please have a look at the PR.
I think it's low risk, and would be helpful to start supporting V5 by default.

@jorgecarleitao
Copy link
Member

If still on time, I will be reviewing this over this weekend, probably Saturday.

This is an area that I am not as familiar and thus need more time ^_^

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @nevi-me for taking this. Looks great. Left a small question, and IMO this has no risk of landing in 3.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the null datatype contain N null values where N is the length of the array?

@alamb
Copy link
Contributor

alamb commented Jan 20, 2021

@nevi-me is this one ready to merge? Should we rebase it against latest master and rerun the checks? Or I can merge it in as is too

Add tests for different padding lengths

update doc

add null tests
@nevi-me
Copy link
Contributor Author

nevi-me commented Jan 21, 2021

@nevi-me is this one ready to merge? Should we rebase it against latest master and rerun the checks? Or I can merge it in as is too

I've rebased, and changed a few comments: V5 will be default in 4.0.0, pointed to the JIRA about duplicate column names.

kszucs pushed a commit that referenced this pull request Jan 25, 2021
We have been using the legacy IPC format, which predates v1.0 of the crate. This PR changes to use the latest version, `ipc::MetadataVersion::V5` from v3.0 of the crate.

The main change was to change the default `IpcWriteOptions`, and add tests

Closes #9122 from nevi-me/ARROW-10299

Authored-by: Neville Dipale <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants