ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality #8590

carols10cents · 2020-11-04T15:39:34Z

The tests added, unignored, and modified in this PR should illustrate the behavior I'm going for.

This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in their ArrayData's BufferData capacities.

This leaves BufferData's PartialEq implementation as-is (that is, taking the equality of capacity into account) and does not modify any of the tests in rust/arrow/src/array/{array, builder, union}.rs or in rust/arrow/src/compute/{kernels/cast, kernels/filter, util}.rs that are testing the internal implementation of Buffer.

This does change PartialEq for ArrayData to not assert to fix the problems noted by @jorgecarleitao .

This has the downside of making test failures where ArrayDatas should be equal but aren't harder to interpret because the assert_eq!s aren't as granular. We can't use debug_assert_eq! either, because then we can't write tests that do assert_ne!(array_data1, array_data2) (I have included an example of such a test). I think to solve this downside, we should add a function only used in tests that does the more granular assertions. I am leaving that to a future PR.

github-actions · 2020-11-04T15:46:30Z

https://issues.apache.org/jira/browse/ARROW-10042

jorgecarleitao · 2020-11-04T16:00:31Z

Could you describe where do we need to compare buffers?

I am asking because I have worked on this problem before, and in my last iteration of this, I abandoned the idea of comparing buffers: a Buffer is purely a physical representation of data without any logical interpretation: two buffers from two different arrays can be equal and the arrays still be different (e.g. due to the nullability or offset), and the opposite is also true: two arrays can be equal but have different buffers (e.g. if the child data is different, or if the datatype is different).

My current hypothesis is that equality should be done via ArrayData, which contains all the relevant data and logical representation of that data -- DataType, to make it possible to logically compare two arrays. I fielded #8541 with that idea, which IMO will also help the parquet work. I think that @nevi-me also had some ideas in mind.

paddyhoran · 2020-11-04T18:13:14Z

I can't find the reference now but this was discussed some time ago on a PR. The thinking for taking capacity into account when comparing Buffer's was that array equality, etc. would be done at a higher level as @jorgecarleitao described. I think that @sunchao, in particular, favored taking capacity into account when comparing Buffer's. i.e. this is a more low-level equality check.

paddyhoran · 2020-11-04T18:17:18Z

Here is the PR: #4331

See the comment about half way down (I'm can't seem to link to it directly)

carols10cents · 2020-11-04T18:36:45Z

If I remove the impl PartialEq for BufferData entirely, compiling then complains that the #[derive(PartialEq)] on Buffer can't work. If I remove the derived PartialEq on Buffer, I then get compilation errors in places such as:

arrow/rust/arrow/src/array/data.rs

Line 264 in 913cd76

assert_eq!(s_sliced, o_sliced);

the new-ish compare_buffer_regions function that I think is relevant to this discussion
arrow/rust/arrow/src/array/array.rs

Line 2429 in 5355f7c

assert_eq!(buf2, arr.values());

and a bunch of other tests in that file
arrow/rust/arrow/src/array/builder.rs

Line 2705 in 913cd76

assert_eq!(buf, arr.values());

and a bunch of other tests in that file

arrow/rust/arrow/src/array/union.rs

Lines 656 to 659 in 913cd76

    
           assert_eq!( 
        
               union.data().buffers()[0], 
        
               Buffer::from(&expected_type_ids.to_byte_slice()) 
        
           );

and a bunch of other tests in that file

A bunch of tests in arrow/src/buffer.rs of course

arrow/rust/arrow/src/compute/kernels/cast.rs

Lines 1365 to 1368 in 913cd76

    
           assert_eq!( 
        
               list_array.data().buffers().to_vec(), 
        
               cast_array.data().buffers().to_vec() 
        
           );

arrow/rust/arrow/src/compute/kernels/filter.rs

Lines 1117 to 1120 in 913cd76

    
           assert_eq!( 
        
               Buffer::from(&[3, 4, 5].to_byte_slice()), 
        
               d.values().data().buffers()[0].clone() 
        
           );

and a bunch of other tests in that file

arrow/rust/arrow/src/compute/util.rs

Lines 241 to 244 in 913cd76

    
           assert_eq!( 
        
               Some(Buffer::from([0b01001010])), 
        
               combine_option_bitmap(&some_bitmap, &none_bitmap, 8).unwrap() 
        
           );

and a bunch of other tests in that file

I'm not sure enough of those cases whether they do expect the capacity to be equal and mean this PartialEq should stay as-is, or if these cases are being tested at too low a level and should be checking ArrayData instead... please let me know what you all think!

jorgecarleitao · 2020-11-04T19:49:10Z

@carols10cents , I do not think that we should remove impl PartialEq for BufferData. I am only trying to understand why would we would like to compare two Buffer and not take the ArrayData::offset into account in the equality. Without the offset, comparing buffers is basically comparing physical memory, and IMO has limited use-cases.

carols10cents · 2020-11-04T19:54:04Z

I do not think that we should remove impl PartialEq for BufferData. I am only trying to understand why would we would like to compare two Buffer and not take the ArrayData::offset into account in the equality. Without the offset, comparing buffers is basically comparing physical memory, and IMO has limited use-cases.

I was not proposing that we remove impl PartialEq for BufferData, I was only doing that to get the compiler to tell me about all the places we're using it.

carols10cents · 2020-11-04T21:54:02Z

Additional data to consider: I'm not confident the implementations are exactly comparable, but the C++ Buffer type does not take its capacity into account in its Equals implementations.

alamb

I wonder if rather than changing the meaning of what "equal" buffers means in the Arrow library, we could instead update this test. Specifically, rather than checking for buffer equality via

            assert_eq!(expected_data.buffers(), actual_data.buffers());

at https://github.com/apache/arrow/pull/8590/files#diff-e037401ca0787613b6d0b7ea98a14374ab37f2addfd079475fc1c33bba2e252cR922

This test could make a comparison specific to the arrow round trip.

For example:

            assert!(compare_buffers(expected_data.buffers(), actual_data.buffers()).unwrap());

With

    // Use a specialized comparison here to check the data of the
    // buffers, ignoring offsets
    fn compare_buffers(b1: &[Buffer], b2: &[Buffer]) -> std::result::Result<bool, String> {
        // Compare two lists of buffers, ignorning offset
        if b1.len() != b2.len() {
            return Err(format!("Length {} doesn't match length {}", b1.len(), b2.len()));
        }

        for (idx, (buffer1, buffer2)) in b1.iter().zip(b2.iter()).enumerate() {
            if buffer1.data() != buffer2.data() {
                return Err(format!("Buffer data mismatch at index {}", idx));
            }
        }

        Ok(true)
    }

I would ideally suggest using the eq kernel, but that appears to only be implemented for primitive arrays, so the following doesn't work 😢

            let expected_col = expected_batch.column(i);
            let actual_col = actual_batch.column(i);
            let compare = arrow::compute::eq(expected_col, actual_col).expect("Comparison");
            for row in 0..compare.len() {
                assert!(compare.value(row), "Mismatch in batch {} @ row {}", i, row);
            }

carols10cents · 2020-11-06T19:03:00Z

@jorgecarleitao @nevi-me @alamb I've just force pushed an alternative solution to this problem that doesn't change BufferData's PartialEq implementation, and I've added more tests to demonstrate the problems I'm trying to solve. I would appreciate another look and to know what you think!

I've updated the PR title and description. I'm still associating this to the same JIRA ticket as I think this change will "resolve" the question posed in the ticket, but it no longer "fixes" the problem as proposed there.

alamb

Look good to me

alamb · 2020-11-06T19:08:28Z

rust/arrow/src/array/data.rs

👍 for comments

I'm pretty sure those are @nevi-me 's, I just kept them in there :)

jhorstmann · 2020-11-06T21:22:57Z

rust/arrow/src/array/data.rs

I think the offset does not have to be equal. Consider a buffer [1,2,3,4,1,2,3,4] and two arrays referencing this buffer, one with offset 0 and len 4, the other with offset 4 and len 4. Both arrays should be considered to contain the same 4 elements. This also means the compare_buffers function needs to take the array length as additional parameter.

jhorstmann · 2020-11-06T21:28:31Z

rust/arrow/src/array/data.rs

compare_buffers_regions also needs an additional parameter for the array length to be correct since the array could be a slice of some bigger underlying buffers.

I missed the PR that introduced compare_buffers_regions, the way it is written it will only work correctly for bitpacked buffers. That is correct for this call, but I found another usage where it is used for other buffers. It uses the buffer.bit_slice method which is documented to take an offset in bits.

Looking at that function again, ~~compare_buffers_regions and also~~ PartialEq for ArrayData do assert_eq! instead of returning true/false, that is probably also not intended.

Update: Ok, I see the return type is fixed in this PR.

jorgecarleitao

I think that the changes in this PR need motivation on why do we should continue to pursue having two different implementations of array comparison:

impl PartialEq for ArrayData
the ArrayEqual trait

IMO there should be one, and only one, implementation of array comparison. Currently we have two, and this PR doubles down on this path. I am not convinced we should continue on this path because it lead us to the path where we will disagree on whether two arrays are equal or not depending on which implementation we use to equate them.

IMO, the root cause of the problem that this PR is trying to solve is exactly the one I just described: we have two implementations of array equality and the tests are failing when one of then is used, even though what is being tested is correct (it is the equality implementation that is wrong). I bet that if we would use make_array and downcast each array to its correct type, the tests would pass.

It was fine to hack PartialEq on the parquet writer branch, because unfortunately we have no way of comparing two ArrayData, but IMO we should now make an effort to ensure that there is one way of comparing two arrays.

As @jhorstmann commented, array comparison is not straightforward as we can't just equate offsets (and many other things). The equality calculation implicitly described in the specification is actually non-trivial, specially for arrays with offset buffers, structs, dictionaries, etc.

#8541 addresses the root cause of this by refactoring the Array equality to depend on a single function that compares two ArrayData, without using the ArrayEqual trait. The motivation for this is exactly what we are trying to solve here: we need to be able to equate two ArrayData because often we operate without the Array trait present (i.e. directly using ArrayData), and the Array equality should derive directly from the equality of ArrayData.

One idea:

Make the array comparison on the parquet tests be done using the PartialEq, like this PR is doing
Make PartialEq use the function equal being introduced on #8541
Remove both #[ignore]

I played with this idea here: https://github.com/jorgecarleitao/arrow/pull/20 and it does seem to work: the tests pass with these changes, including the un-ignored ones.

jorgecarleitao · 2020-11-07T06:26:29Z

Also, I very much agree with @carols10cents that we need a methodology to compare arrays with granularity :)

nevi-me · 2020-11-07T10:41:10Z

One idea:

Make the array comparison on the parquet tests be done using the PartialEq, like this PR is doing

Make PartialEq use the function equal being introduced on ARROW-10402: [Rust] Refactor array equality #8541

Remove both #[ignore]

I also like this approach, as #8541 now accounts for logical equality

nevi-me · 2020-11-10T21:28:27Z

Hi @carols10cents, with the IPC integration and logical equality PRs merged, this should be ready (the Parquet changes).
We can merge it in after you rebase :)

carols10cents · 2020-11-10T22:00:35Z

@nevi-me done! The only changes left here are in tests :) Thank you all for the work related to this!

When compared using the PartialEq implementation, two Buffers that only differ in their capacity are considered unequal. This comparison is relevant when testing implementation details of Buffer and BufferData. When comparing Buffers that are part of ArrayData, the Buffer capacity should NOT be taken into account: the logical equality of the data is computed differently.

alamb · 2020-11-10T22:45:06Z

@nevi-me and @jorgecarleitao I think we should merge this PR -- is there anything else that needs to be done before we do so?

alamb · 2020-11-10T22:46:39Z

(though the title may need to be updated)

carols10cents · 2020-11-10T22:48:21Z

(updated)

nevi-me · 2020-11-10T22:50:22Z

@nevi-me and @jorgecarleitao I think we should merge this PR -- is there anything else that needs to be done before we do so?

You may go ahead and merge it :) I was waiting for integration tests to complete, but I'm about to head to sleep now

alamb · 2020-11-10T22:53:23Z

The tests have passed:

I'll merge it

github-actions bot added the Component: Rust label Nov 4, 2020

alamb reviewed Nov 6, 2020

View reviewed changes

carols10cents force-pushed the bufferdata-eq branch from 6c82060 to 4d57eb2 Compare November 6, 2020 18:48

carols10cents changed the title ~~ARROW-10042: [Rust] Only compare BufferData's data for equality~~ ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality Nov 6, 2020

alamb approved these changes Nov 6, 2020

View reviewed changes

jhorstmann reviewed Nov 6, 2020

View reviewed changes

nevi-me self-requested a review November 7, 2020 05:08

jorgecarleitao reviewed Nov 7, 2020

View reviewed changes

andygrove added the needs-rebase A PR that needs to be rebased by the author label Nov 10, 2020

carols10cents force-pushed the bufferdata-eq branch from 4d57eb2 to 32c7c38 Compare November 10, 2020 22:00

nevi-me approved these changes Nov 10, 2020

View reviewed changes

nevi-me removed the needs-rebase A PR that needs to be rebased by the author label Nov 10, 2020

carols10cents force-pushed the bufferdata-eq branch from 32c7c38 to 159648e Compare November 10, 2020 22:15

carols10cents changed the title ~~ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality~~ ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality Nov 10, 2020

alamb closed this in 14501b2 Nov 10, 2020

asfimport mentioned this pull request Nov 10, 2020

[Rust] Buffer equalities may be incorrect #26064

Closed

ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality #8590

ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality #8590

Uh oh!

Conversation

carols10cents commented Nov 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 4, 2020

Uh oh!

jorgecarleitao commented Nov 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddyhoran commented Nov 4, 2020

Uh oh!

paddyhoran commented Nov 4, 2020

Uh oh!

carols10cents commented Nov 4, 2020

Uh oh!

jorgecarleitao commented Nov 4, 2020

Uh oh!

carols10cents commented Nov 4, 2020

Uh oh!

carols10cents commented Nov 4, 2020

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carols10cents commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Nov 6, 2020

Choose a reason for hiding this comment

Uh oh!

carols10cents Nov 6, 2020

Choose a reason for hiding this comment

Uh oh!

jhorstmann Nov 6, 2020

Choose a reason for hiding this comment

Uh oh!

jhorstmann Nov 6, 2020

Choose a reason for hiding this comment

Uh oh!

jhorstmann Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao commented Nov 7, 2020

Uh oh!

nevi-me commented Nov 7, 2020

Uh oh!

nevi-me commented Nov 10, 2020

Uh oh!

carols10cents commented Nov 10, 2020

Uh oh!

alamb commented Nov 10, 2020

Uh oh!

alamb commented Nov 10, 2020

Uh oh!

carols10cents commented Nov 10, 2020

Uh oh!

nevi-me commented Nov 10, 2020

Uh oh!

alamb commented Nov 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

carols10cents commented Nov 4, 2020 •

edited

Loading

jorgecarleitao commented Nov 4, 2020 •

edited

Loading

alamb left a comment •

edited

Loading

carols10cents commented Nov 6, 2020 •

edited

Loading

jhorstmann Nov 6, 2020 •

edited

Loading