-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality #8590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could you describe where do we need to compare buffers? I am asking because I have worked on this problem before, and in my last iteration of this, I abandoned the idea of comparing buffers: a My current hypothesis is that equality should be done via |
|
I can't find the reference now but this was discussed some time ago on a PR. The thinking for taking |
|
Here is the PR: #4331 See the comment about half way down (I'm can't seem to link to it directly) |
|
If I remove the
I'm not sure enough of those cases whether they do expect the |
|
@carols10cents , I do not think that we should remove |
I was not proposing that we remove |
|
Additional data to consider: I'm not confident the implementations are exactly comparable, but the C++ Buffer type does not take its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if rather than changing the meaning of what "equal" buffers means in the Arrow library, we could instead update this test. Specifically, rather than checking for buffer equality via
assert_eq!(expected_data.buffers(), actual_data.buffers());
This test could make a comparison specific to the arrow round trip.
For example:
assert!(compare_buffers(expected_data.buffers(), actual_data.buffers()).unwrap());With
// Use a specialized comparison here to check the data of the
// buffers, ignoring offsets
fn compare_buffers(b1: &[Buffer], b2: &[Buffer]) -> std::result::Result<bool, String> {
// Compare two lists of buffers, ignorning offset
if b1.len() != b2.len() {
return Err(format!("Length {} doesn't match length {}", b1.len(), b2.len()));
}
for (idx, (buffer1, buffer2)) in b1.iter().zip(b2.iter()).enumerate() {
if buffer1.data() != buffer2.data() {
return Err(format!("Buffer data mismatch at index {}", idx));
}
}
Ok(true)
}I would ideally suggest using the eq kernel, but that appears to only be implemented for primitive arrays, so the following doesn't work 😢
let expected_col = expected_batch.column(i);
let actual_col = actual_batch.column(i);
let compare = arrow::compute::eq(expected_col, actual_col).expect("Comparison");
for row in 0..compare.len() {
assert!(compare.value(row), "Mismatch in batch {} @ row {}", i, row);
}6c82060 to
4d57eb2
Compare
|
@jorgecarleitao @nevi-me @alamb I've just force pushed an alternative solution to this problem that doesn't change I've updated the PR title and description. I'm still associating this to the same JIRA ticket as I think this change will "resolve" the question posed in the ticket, but it no longer "fixes" the problem as proposed there. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good to me
rust/arrow/src/array/data.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure those are @nevi-me 's, I just kept them in there :)
rust/arrow/src/array/data.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the offset does not have to be equal. Consider a buffer [1,2,3,4,1,2,3,4] and two arrays referencing this buffer, one with offset 0 and len 4, the other with offset 4 and len 4. Both arrays should be considered to contain the same 4 elements. This also means the compare_buffers function needs to take the array length as additional parameter.
rust/arrow/src/array/data.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare_buffers_regions also needs an additional parameter for the array length to be correct since the array could be a slice of some bigger underlying buffers.
I missed the PR that introduced compare_buffers_regions, the way it is written it will only work correctly for bitpacked buffers. That is correct for this call, but I found another usage where it is used for other buffers. It uses the buffer.bit_slice method which is documented to take an offset in bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at that function again, compare_buffers_regions and alsoPartialEq for ArrayData do assert_eq! instead of returning true/false, that is probably also not intended.
Update: Ok, I see the return type is fixed in this PR.
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the changes in this PR need motivation on why do we should continue to pursue having two different implementations of array comparison:
impl PartialEq for ArrayData- the
ArrayEqualtrait
IMO there should be one, and only one, implementation of array comparison. Currently we have two, and this PR doubles down on this path. I am not convinced we should continue on this path because it lead us to the path where we will disagree on whether two arrays are equal or not depending on which implementation we use to equate them.
IMO, the root cause of the problem that this PR is trying to solve is exactly the one I just described: we have two implementations of array equality and the tests are failing when one of then is used, even though what is being tested is correct (it is the equality implementation that is wrong). I bet that if we would use make_array and downcast each array to its correct type, the tests would pass.
It was fine to hack PartialEq on the parquet writer branch, because unfortunately we have no way of comparing two ArrayData, but IMO we should now make an effort to ensure that there is one way of comparing two arrays.
As @jhorstmann commented, array comparison is not straightforward as we can't just equate offsets (and many other things). The equality calculation implicitly described in the specification is actually non-trivial, specially for arrays with offset buffers, structs, dictionaries, etc.
#8541 addresses the root cause of this by refactoring the Array equality to depend on a single function that compares two ArrayData, without using the ArrayEqual trait. The motivation for this is exactly what we are trying to solve here: we need to be able to equate two ArrayData because often we operate without the Array trait present (i.e. directly using ArrayData), and the Array equality should derive directly from the equality of ArrayData.
One idea:
- Make the array comparison on the parquet tests be done using the
PartialEq, like this PR is doing - Make
PartialEquse the functionequalbeing introduced on #8541 - Remove both
#[ignore]
I played with this idea here: https://github.com/jorgecarleitao/arrow/pull/20 and it does seem to work: the tests pass with these changes, including the un-ignored ones.
|
Also, I very much agree with @carols10cents that we need a methodology to compare arrays with granularity :) |
I also like this approach, as #8541 now accounts for logical equality |
|
Hi @carols10cents, with the IPC integration and logical equality PRs merged, this should be ready (the Parquet changes). |
4d57eb2 to
32c7c38
Compare
|
@nevi-me done! The only changes left here are in tests :) Thank you all for the work related to this! |
When compared using the PartialEq implementation, two Buffers that only differ in their capacity are considered unequal. This comparison is relevant when testing implementation details of Buffer and BufferData. When comparing Buffers that are part of ArrayData, the Buffer capacity should NOT be taken into account: the logical equality of the data is computed differently.
32c7c38 to
159648e
Compare
|
@nevi-me and @jorgecarleitao I think we should merge this PR -- is there anything else that needs to be done before we do so? |
|
(though the title may need to be updated) |
|
(updated) |
You may go ahead and merge it :) I was waiting for integration tests to complete, but I'm about to head to sleep now |

The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for.
This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their
ArrayData'sBufferDatacapacities.This leaves
BufferData'sPartialEqimplementation as-is (that is, taking the equality ofcapacityinto account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation ofBuffer.This does change
PartialEqforArrayDatato not assert to fix the problems noted by @jorgecarleitao .This has the downside of making test failures where
ArrayDatas should be equal but aren't harder to interpret because theassert_eq!s aren't as granular. We can't usedebug_assert_eq!either, because then we can't write tests that doassert_ne!(array_data1, array_data2)(I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR.