-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10299: [Rust] Use IPC Metadata V5 as default #9122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@carols10cents FYI @andygrove here's the 3.0.0 blocker |
Codecov Report
@@ Coverage Diff @@
## master #9122 +/- ##
==========================================
+ Coverage 81.64% 81.74% +0.09%
==========================================
Files 215 215
Lines 52489 52572 +83
==========================================
+ Hits 42857 42975 +118
+ Misses 9632 9597 -35
Continue to review full report at Codecov.
|
rust/arrow/src/array/array_struct.rs
Outdated
|
|
||
| /// Return child array whose field name equals to column_name | ||
| /// | ||
| /// Note: The Arrow specification allows for duplicate field names, and in such |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find about allows for duplicate filed names from https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a duplicate_field_name test, which fails because we always get the first named field, perhaps it hasn't been codified into the spec. I'll check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By allowing duplicate field name is rather strange and uncertain, perhaps that file is used for error checking purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the IPC doc: A struct is a nested type parameterized by an ordered sequence of types (which can all be distinct), called its fields. Each field must have a UTF8-encoded name, and these field names are part of the type metadata.
The which can all be distinct is totally wasting of time, isn't it?
// From this point, there's one or more field in the builder that exists with
// the same name.
if (policy_ == CONFLICT_IGNORE) {
// The ignore policy is more generous when there's duplicate in the builder.
return Status::OK();
} else if (policy_ == CONFLICT_ERROR) {
return Status::Invalid("Duplicate found, policy dictate to treat as an error");
}/**
* Policy to determine how to react when duplicate columns are encountered.
*/
public enum ConflictPolicy {
// Ignore the conflict and append the field. This is the default behaviour
CONFLICT_APPEND,
// Keep the existing field and ignore the newer one.
CONFLICT_IGNORE,
// Replace the existing field with the newer one.
CONFLICT_REPLACE,
// Refuse the new field and error out.
CONFLICT_ERROR
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware of this, thanks for finding the references. This could explain why the duplicate_field_names integration tests fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not acceptable if we silently append/ignore/replace duplicate fields. Even if we can managed to let user to configure this global behavior, it may not satisfy all possible duplications even in a single file: some one want to append but another want to ignore. So, at present before we can find the solution to let user define the behavior, I suggest raising error on duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good idea, may you please open a JIRA for the work; even if we don't get to complete it in time for 3.0.0.
I also prefer raising an error by default, as that'll make users aware very quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a similar discussion on DataFusion for RecordBatch a while back. We now just refuse duplicated column names, as it offers a much, much simpler way of dealing with columns via column names.
However, sometimes people have to parse files with equal column names, which means that arrow somehow has to support them on RecordBatch.
mqy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this comment is fine
rust/arrow/src/array/array_struct.rs
Outdated
|
|
||
| /// Return child array whose field name equals to column_name | ||
| /// | ||
| /// Note: The Arrow specification allows for duplicate field names, and in such |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@nevi-me what's the status here? I assume it would be nice to include it in the release. |
|
@kszucs needs review. I hadn't noticed that I had only added Andy as a reviewer. @alamb @jorgecarleitao @paddyhoran if you have some time, may you please have a look at the PR. |
|
If still on time, I will be reviewing this over this weekend, probably Saturday. This is an area that I am not as familiar and thus need more time ^_^ |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @nevi-me for taking this. Looks great. Left a small question, and IMO this has no risk of landing in 3.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the null datatype contain N null values where N is the length of the array?
|
@nevi-me is this one ready to merge? Should we rebase it against latest master and rerun the checks? Or I can merge it in as is too |
Add tests for different padding lengths update doc add null tests
I've rebased, and changed a few comments: V5 will be default in 4.0.0, pointed to the JIRA about duplicate column names. |
We have been using the legacy IPC format, which predates v1.0 of the crate. This PR changes to use the latest version, `ipc::MetadataVersion::V5` from v3.0 of the crate. The main change was to change the default `IpcWriteOptions`, and add tests Closes #9122 from nevi-me/ARROW-10299 Authored-by: Neville Dipale <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>
We have been using the legacy IPC format, which predates v1.0 of the crate. This PR changes to use the latest version,
ipc::MetadataVersion::V5from v3.0 of the crate.The main change was to change the default
IpcWriteOptions, and add tests