Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve comparison of struct fields #1839

Conversation

nl5887
Copy link
Contributor

@nl5887 nl5887 commented Jun 10, 2022

Currently if the order of fields in struct is different, it will fail.
This fix will lookup fields, ignoring the field order and length and
fail if not nullable.

Array data type comparison will use the equals_datatype instead of
comparing enums.

This fix work towards closing of apache/datafusion#2326.

@alamb

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 10, 2022
&& a.iter().zip(b).all(|(a, b)| {
a.is_nullable() == b.is_nullable()
&& a.data_type().equals_datatype(b.data_type())
a.iter().all(|a| {
Copy link
Member

@viirya viirya Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not correct. equals_datatype compares exact data type equality (except for metadata and nested field names). Technically speaking, two structs are equal only if the nested field data types are equal.

In apache/datafusion#2326, seems you want to overcome some errors at column projection in DataFusion. For column projection, it requires two structs to be "compatible", so nested field order can be ignored (because it will be "mapped" later). It is a more loose data type equality than the exact equality of equals_datatype.

So I don't think we should change equals_datatype. For compatible data type change, maybe we should have another method for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having another method would be good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've added a method compatible_datatype, and reverted my changes to equal_datatype.

The new method will look if two datatypes are compatible which each other,
ignoring the field order and length and fail if missing and not nullable.

Array data type comparison will use the equals_datatype instead of
comparing the enums itself.
@nl5887 nl5887 force-pushed the feature-ignore-field-order-and-nullable-fields-in-comparison branch from 48f025c to 24872b2 Compare June 10, 2022 17:48
@@ -62,7 +62,7 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {

if arrays
.iter()
.any(|array| array.data_type() != arrays[0].data_type())
.any(|array| !array.data_type().equals_datatype(arrays[0].data_type()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be reverted too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be changed to compatible_datatype (at least for my own usecase).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays can be concatted as long as they are compatible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our concat kernel doesn't map out-of-order nested fields currently. It is why it checks data type with exact equality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to focus on compatible datatype method in this PR. You can definitely propose do compatible concat kernel later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, reverted the concat change.

@codecov-commenter
Copy link

Codecov Report

Merging #1839 (ed0ed75) into master (8d787d9) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #1839      +/-   ##
==========================================
- Coverage   83.48%   83.45%   -0.04%     
==========================================
  Files         201      201              
  Lines       56838    56931      +93     
==========================================
+ Hits        47452    47510      +58     
- Misses       9386     9421      +35     
Impacted Files Coverage Δ
arrow/src/datatypes/datatype.rs 59.59% <0.00%> (-6.21%) ⬇️
arrow/src/compute/kernels/temporal.rs 95.77% <0.00%> (-1.36%) ⬇️
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (-0.23%) ⬇️
parquet/src/encodings/encoding.rs 93.46% <0.00%> (-0.20%) ⬇️
arrow/src/array/mod.rs 100.00% <0.00%> (ø)
arrow/src/compute/kernels/substring.rs 99.53% <0.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d787d9...ed0ed75. Read the comment docs.

@tustvold
Copy link
Contributor

tustvold commented Jun 13, 2022

I'm not sure about this, reordering fields in a StructArray is not compatible? To me this feels like a limitation of the SchemaAdapter logic in DataFusion which needs to:

  • Understand nested types
  • Re-project fields within nested types

Columns and fields are identified by index and not column name, and so they cannot be arbitrarily reordered?

It also possibly relates to apache/datafusion#2581

@tustvold tustvold marked this pull request as draft June 16, 2022 20:21
@tustvold
Copy link
Contributor

This PR has been inactive for a while so closing to clear the backlog, please feel free to reopen if you come back to this

@tustvold tustvold closed this Jan 15, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants