-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10402: [Rust] Refactor array equality #8541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I have simplified this code further to allow the compiler to optimize out some functions. The code is now 10-40% faster compared to current master. I updated the description accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that we can address later, is that if a struct slot is null, we should carry that nullness across to its children. This is actually the problem that @carols10cents and I encountered on both parquet roundtrips and integration tests.
The spec (http://arrow.apache.org/docs/format/Columnar.html#struct-layout) says:
While a struct does not have physical storage for each of its semantic slots (i.e. each scalar C-like struct), an entire struct slot can be set to null via the validity bitmap. Any of the child field arrays can have null values according to their respective independent validity bitmaps. This implies that for a particular struct slot the validity bitmap for the struct array might indicate a null slot when one or more of its child arrays has a non-null value in their corresponding slot. When reading the struct array the parent validity bitmap takes priority.
I think we can address this in a separate PR, if we still find that struct_equal doesn't cover that edge case.
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorgecarleitao
I'm happy that this PR's been updated for logical equality, and I'd propose we merge it in, then look at #8590 and #8200 to address edge-cases where the IPC and Parquet tests still fail equality tests.
@carols10cents @alamb @paddyhoran @jhorstmann what's your opinion?
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I made it through this entire PR carefully, but I reviewed about 50% of it carefully and skimmed the rest -- what I did see had a logical and understandable structure I would feel comfortable supporting / maintaining, FWIW.
I agree with @jorgecarleitao that having a uniform definition of array equality will solve many of the challenges we have seen (especially with parquet <--> arrow round tripping)
Thus my opinion is that we should merge this.
rust/arrow/src/array/array.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:heart
rust/arrow/src/array/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment may need to be updated as the code seems to no longer account for offset
rust/arrow/src/array/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is certainly a lot nicer
rust/arrow/src/array/builder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not that you did it, but I wonder why this code is commented out -- maybe your changes will have fixed it
rust/arrow/src/array/data.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for the panic
rust/arrow/src/array/data.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Returns the buffer `buffer` as a slice of type `T`. The slice is already offset. | |
| /// Returns self.buffers at index `buffer` as a slice of type `T` starting at self.offset |
rust/arrow/src/array/equal/mod.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fn equal_values( | |
| /// Compares two arrays for equality starting at `lhs_start` and `rhs_start` | |
| /// `lhs` and `rhs` *must* have the same data type | |
| fn equal_values( |
rust/arrow/src/array/data.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, I think I have another ticket that would be fixed by this change
|
@alamb , @nevi-me and @jhorstmann , thanks a lot for taking a close look into this, also for the amazing work yesterday and today on reviewing and merging stuff. Really impressive! 💯 I rebased this and resolved conflicts. |
I'm super-excited, because this likely unlocks us on the failures that I've been getting on #8200. |
This is a major refactor of the
equal.rsmodule.The rational for this change is many fold:
sort,takeandconcatenatekernel's tests, and some of the tests of the builders.unsafeAPIs that we have (via pointer aritmetics), which makes it risky to operate and mutate.This PR:
impl PartialEq for dyn Array, to allowArraycomparison based onArray::data(main change)ArrayData, i.e. it no longer depends on concrete array types (such asPrimitiveArrayand related API) to perform comparisons.rangecomparisonmatch datatype.equal.rsin smaller, more manageable files.ArrayListOps, since it it no longer neededNote that this does not implement
PartialEqforArrayData, onlydyn Array, as different data does not imply a different array (due to nullability). That implementation is being worked on #8200.IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long.
This also improves performance by 10-40%.
Benchmark results
All tests are there, plus new tests for some of the edge cases and untested arrays.
This change is backward incompatible
array1.equals(&array2)no longer works: usearray1 == array2instead, which is the idiomatic way of comparing structs and trait objects in rust.