Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Nov 16, 2025

…thods, deprecate old methods

Which issue does this PR close?

Rationale for this change

  1. bitwise_bin_op_helper and bitwise_unary_op_helper are somewhat hard to find and use
    as explained on WIP: special case bitwise ops when buffers are u64 aligned #8807

  2. I want to optimize bitwise operations even more heavily (see WIP: special case bitwise ops when buffers are u64 aligned #8807) so I want the implementations centralized so I can focus the efforts there

Also, I think these APIs I think cover the usecase explained by @jorstmann on #8561:

Building a new buffer by starting from an empty state and incrementally appending new bits (append_value, append_slice, append_packed_range and similar methods).

By creating a method on Buffer directly, it is easier to find, and it is clearer that
a new Buffer is being created.

What changes are included in this PR?

Changes:

  1. Add Buffer::from_bitwise_unary and Buffer::from_bitwise_binary methods that do the same thing as bitwise_unary_op_helper and bitwise_bin_op_helper but are easier to find and use
  2. Deprecate bitwise_unary_op_helper and bitwise_bin_op_helper in favor
    of the new Buffer methods
  3. Document the new methods, with examples (specifically that the bitwise operations
    operate on bits, not bytes and shouldn't do any cross byte operations)

Are these changes tested?

Yes, new doc tests

Are there any user-facing changes?

New APIs, some deprecated

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 16, 2025
@alamb alamb force-pushed the alamb/bitwise_ops branch from 3c68505 to 69e68a1 Compare November 16, 2025 14:02
@alamb alamb force-pushed the alamb/bitwise_ops branch from 69e68a1 to d5a3604 Compare November 16, 2025 14:04
@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖: Benchmark completed

Details

group         alamb_bitwise_ops                      main
-----         -----------------                      ----
and           1.00    272.6±1.27ns        ? ?/sec    1.00    272.7±0.86ns        ? ?/sec
and_sliced    1.00   1096.3±7.89ns        ? ?/sec    1.00   1094.7±3.34ns        ? ?/sec
not           1.00    213.1±0.25ns        ? ?/sec    1.00    214.2±1.06ns        ? ?/sec
not_sliced    1.01    965.5±1.32ns        ? ?/sec    1.00    960.6±3.89ns        ? ?/sec
or            1.01    255.1±0.63ns        ? ?/sec    1.00    253.8±1.86ns        ? ?/sec
or_sliced     1.00   1228.0±7.56ns        ? ?/sec    1.00  1227.8±18.85ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖: Benchmark completed

Details

group                                alamb_bitwise_ops                      main
-----                                -----------------                      ----
buffer_binary_ops/and                1.00    259.6±0.56ns    55.1 GB/sec    1.00    258.9±2.00ns    55.2 GB/sec
buffer_binary_ops/and_with_offset    1.12   1486.1±2.12ns     9.6 GB/sec    1.00   1322.8±9.40ns    10.8 GB/sec
buffer_binary_ops/or                 1.00    239.3±0.60ns    59.8 GB/sec    1.07    256.3±1.96ns    55.8 GB/sec
buffer_binary_ops/or_with_offset     1.00   1355.4±2.50ns    10.6 GB/sec    1.10  1484.8±14.40ns     9.6 GB/sec
buffer_unary_ops/not                 1.14    257.5±0.71ns    37.0 GB/sec    1.00    225.9±3.19ns    42.2 GB/sec
buffer_unary_ops/not_with_offset     1.00    868.1±2.51ns    11.0 GB/sec    1.34  1160.1±14.15ns     8.2 GB/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖: Benchmark completed

Details

group         alamb_bitwise_ops                      main
-----         -----------------                      ----
and           1.00    272.4±1.45ns        ? ?/sec    1.00    273.1±1.36ns        ? ?/sec
and_sliced    1.00   1096.0±1.60ns        ? ?/sec    1.00   1095.1±2.77ns        ? ?/sec
not           1.00    213.8±0.29ns        ? ?/sec    1.00    214.0±0.40ns        ? ?/sec
not_sliced    1.00    965.6±9.77ns        ? ?/sec    1.00    961.8±5.75ns        ? ?/sec
or            1.00    254.1±0.66ns        ? ?/sec    1.01    255.6±0.41ns        ? ?/sec
or_sliced     1.00   1225.5±2.12ns        ? ?/sec    1.00   1226.9±7.43ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖: Benchmark completed

Details

group                                alamb_bitwise_ops                      main
-----                                -----------------                      ----
buffer_binary_ops/and                1.00    259.7±0.55ns    55.1 GB/sec    1.00    259.3±4.36ns    55.2 GB/sec
buffer_binary_ops/and_with_offset    1.13   1486.2±3.20ns     9.6 GB/sec    1.00   1320.5±3.78ns    10.8 GB/sec
buffer_binary_ops/or                 1.00    239.2±0.34ns    59.8 GB/sec    1.07    256.2±0.89ns    55.8 GB/sec
buffer_binary_ops/or_with_offset     1.00   1355.8±4.32ns    10.6 GB/sec    1.09   1483.7±4.32ns     9.6 GB/sec
buffer_unary_ops/not                 1.13    257.1±0.97ns    37.1 GB/sec    1.00    226.6±1.72ns    42.1 GB/sec
buffer_unary_ops/not_with_offset     1.00    863.6±3.06ns    11.0 GB/sec    1.32   1139.4±2.91ns     8.4 GB/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 18, 2025

The benchmarks show a slowdown for some operations for some reason

buffer_binary_ops/and_with_offset 1.13 1486.2±3.20ns 9.6 GB/sec 1.00 1320.5±3.78ns 10.8 GB/sec

However, given the duration of the benchmark, I am thinking maybe this is cache lines or something.

I have an idea of how to improve the benchmarks so they are less noisy (basically run them in a 100x loop)

let rem = op(left_chunks.remainder_bits(), right_chunks.remainder_bits());
// we are counting its starting from the least significant bit, to to_le_bytes should be correct
let rem = &rem.to_le_bytes()[0..remainder_bytes];
buffer.extend_from_slice(rem);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might do an extra allocation? Other places avoid this by preallocating the final u64 needed for the remainder as well (collect_bool)

Copy link
Contributor Author

@alamb alamb Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good call -- I will make the change

However, this is same code as how the current bitwise_binary_op does it, so I would expect no performance difference 🤔

https://github.com/apache/arrow-rs/pull/8854/files#diff-e7a951ab8abfeef1016ed4427a3aef25be5be470454caa1e1dd93e56968316b5L122

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, however allocations during benchmarking seems to make benchmarking very noisy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I tried this

  pub fn from_bitwise_binary_op<F>(
        left: impl AsRef<[u8]>,
        left_offset_in_bits: usize,
        right: impl AsRef<[u8]>,
        right_offset_in_bits: usize,
        len_in_bits: usize,
        mut op: F,
    ) -> Buffer
    where
        F: FnMut(u64, u64) -> u64,
    {
        let left_chunks = BitChunks::new(left.as_ref(), left_offset_in_bits, len_in_bits);
        let right_chunks = BitChunks::new(right.as_ref(), right_offset_in_bits, len_in_bits);

        let remainder_bytes = ceil(left_chunks.remainder_len(), 8);
        // if it evenly divides into u64 chunks
        let buffer = if remainder_bytes == 0 {
            let chunks = left_chunks
                .iter()
                .zip(right_chunks.iter())
                .map(|(left, right)| op(left, right));
            // Soundness: `BitChunks` is a `BitChunks` iterator which
            // correctly reports its upper bound
            unsafe { MutableBuffer::from_trusted_len_iter(chunks) }
        } else {
            // Compute last u64 here so that we can reserve exact capacity
            let rem = op(left_chunks.remainder_bits(), right_chunks.remainder_bits());

            let chunks = left_chunks
                .iter()
                .zip(right_chunks.iter())
                .map(|(left, right)| op(left, right))
                .chain(std::iter::once(rem));
            // Soundness: `BitChunks` is a `BitChunks` iterator which
            // correctly reports its upper bound, and so is the `chain` iterator
            let mut buffer = unsafe { MutableBuffer::from_trusted_len_iter(chunks) };
            // Adjust the length down if last u64 is not fully used
            let extra_bytes = 8 - remainder_bytes;
            buffer.truncate(buffer.len() - extra_bytes);
            buffer
        };
        buffer.into()
    }

But it seems to be slower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried making a version of MutableBuffer::from_trusted_len_iter that also added additional and it didn't seem to help either (perhaps because the benchmarks happen to avoid reallocation 🤔 )

    /// Like [`from_trusted_len_iter`] but can add additional capacity at the end
    /// in case the caller wants to add more data after the initial iterator.
    #[inline]
    pub unsafe fn from_trusted_len_iter_with_additional_capacity<T: ArrowNativeType, I: Iterator<Item = T>>(
        iterator: I,
        additional_capacity: usize,
    ) -> Self {
        let item_size = std::mem::size_of::<T>();
        let (_, upper) = iterator.size_hint();
        let upper = upper.expect("from_trusted_len_iter requires an upper limit");
        let len = upper * item_size;

        let mut buffer = MutableBuffer::new(len + additional_capacity);

        let mut dst = buffer.data.as_ptr();
        for item in iterator {
            // note how there is no reserve here (compared with `extend_from_iter`)
            let src = item.to_byte_slice().as_ptr();
            unsafe { std::ptr::copy_nonoverlapping(src, dst, item_size) };
            dst = unsafe { dst.add(item_size) };
        }
        assert_eq!(
            unsafe { dst.offset_from(buffer.data.as_ptr()) } as usize,
            len,
            "Trusted iterator length was not accurately reported"
        );
        buffer.len = len;
        buffer
    }

Copy link
Contributor

@Dandandan Dandandan Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a extend from trusted len iter in MutableBuffer? Other option is to use Vec::extend here as well.

F: FnMut(u64) -> u64,
{
// reserve capacity and set length so we can get a typed view of u64 chunks
let mut result =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we overwrite the results, we shouldn't need to initialize/zero out the array.

@Dandandan
Copy link
Contributor

The benchmarks show a slowdown for some operations for some reason

buffer_binary_ops/and_with_offset 1.13 1486.2±3.20ns 9.6 GB/sec 1.00 1320.5±3.78ns 10.8 GB/sec

However, given the duration of the benchmark, I am thinking maybe this is cache lines or something.

I have an idea of how to improve the benchmarks so they are less noisy (basically run them in a 100x loop)

Might also because of the allocation? Looks like and_with_offset and and are not a over a power of two inputs.

@alamb
Copy link
Contributor Author

alamb commented Dec 13, 2025

run benchmark buffer_bit_ops boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (819210e) to c6cc7f8 diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                alamb_bitwise_ops                      main
-----                                -----------------                      ----
buffer_binary_ops/and                3.73   977.9±12.68ns    14.6 GB/sec    1.00    262.0±2.96ns    54.6 GB/sec
buffer_binary_ops/and_with_offset    1.20   1795.3±4.61ns     8.0 GB/sec    1.00  1493.0±24.14ns     9.6 GB/sec
buffer_binary_ops/or                 3.84    977.7±7.18ns    14.6 GB/sec    1.00    254.6±0.80ns    56.2 GB/sec
buffer_binary_ops/or_with_offset     1.39  1838.2±48.86ns     7.8 GB/sec    1.00   1324.5±6.34ns    10.8 GB/sec
buffer_unary_ops/not                 2.76    625.5±8.53ns    15.2 GB/sec    1.00    226.8±3.71ns    42.1 GB/sec
buffer_unary_ops/not_with_offset     1.12    927.0±4.27ns    10.3 GB/sec    1.00    831.2±1.37ns    11.5 GB/sec

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (819210e) to c6cc7f8 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group         alamb_bitwise_ops                      main
-----         -----------------                      ----
and           3.00    828.3±5.53ns        ? ?/sec    1.00    276.4±1.71ns        ? ?/sec
and_sliced    1.10  1349.6±20.70ns        ? ?/sec    1.00  1230.0±36.69ns        ? ?/sec
not           1.87    402.3±4.99ns        ? ?/sec    1.00    214.9±1.25ns        ? ?/sec
not_sliced    1.11   777.0±10.46ns        ? ?/sec    1.00    701.4±6.43ns        ? ?/sec
or            3.30   821.4±24.65ns        ? ?/sec    1.00    249.0±0.79ns        ? ?/sec
or_sliced     1.23  1340.2±11.33ns        ? ?/sec    1.00   1093.7±9.34ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2025

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (c6a2e40) to c6cc7f8 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group         alamb_bitwise_ops                      main
-----         -----------------                      ----
and           1.00    273.5±5.64ns        ? ?/sec    1.01    275.4±1.59ns        ? ?/sec
and_sliced    1.00  1027.1±17.88ns        ? ?/sec    1.20  1229.1±17.09ns        ? ?/sec
not           1.00    183.9±2.69ns        ? ?/sec    1.18    216.2±2.04ns        ? ?/sec
not_sliced    1.00    619.8±9.56ns        ? ?/sec    1.13    701.3±1.85ns        ? ?/sec
or            1.00    248.0±0.86ns        ? ?/sec    1.01    249.8±1.80ns        ? ?/sec
or_sliced     1.00   1023.1±5.07ns        ? ?/sec    1.07   1092.0±2.68ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2025

run benchmark buffer_bit_ops

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2025

run benchmark boolean_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (82bc7aa) to c6cc7f8 diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                alamb_bitwise_ops                      main
-----                                -----------------                      ----
buffer_binary_ops/and                1.00    216.0±1.37ns    66.2 GB/sec    1.22   263.6±11.48ns    54.3 GB/sec
buffer_binary_ops/and_with_offset    1.00   1234.9±3.79ns    11.6 GB/sec    1.21  1489.1±11.61ns     9.6 GB/sec
buffer_binary_ops/or                 1.00    211.3±1.96ns    67.7 GB/sec    1.21    255.2±3.15ns    56.1 GB/sec
buffer_binary_ops/or_with_offset     1.00   1268.1±6.32ns    11.3 GB/sec    1.04  1324.3±13.77ns    10.8 GB/sec
buffer_unary_ops/not                 1.00    182.7±1.70ns    52.2 GB/sec    1.24    226.5±1.22ns    42.1 GB/sec
buffer_unary_ops/not_with_offset     1.00   750.8±24.40ns    12.7 GB/sec    1.11    829.7±6.96ns    11.5 GB/sec

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (82bc7aa) to c6cc7f8 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group         alamb_bitwise_ops                      main
-----         -----------------                      ----
and           1.00    211.3±5.58ns        ? ?/sec    1.30    274.8±1.53ns        ? ?/sec
and_sliced    1.00  1033.6±27.44ns        ? ?/sec    1.19   1225.3±5.24ns        ? ?/sec
not           1.00    146.4±0.60ns        ? ?/sec    1.47    215.2±4.93ns        ? ?/sec
not_sliced    1.00    621.0±3.26ns        ? ?/sec    1.13    700.8±7.63ns        ? ?/sec
or            1.00    200.8±0.91ns        ? ?/sec    1.24    249.1±1.57ns        ? ?/sec
or_sliced     1.00   1030.3±4.63ns        ? ?/sec    1.06  1093.6±11.06ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2025

Might also because of the allocation? Looks like and_with_offset and and are not a over a power of two inputs.

You were spot on here here @Dandandan -- getting rid of the extra allocation made a non trivial difference in the benchmarks

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2025

Update here: benchmarks are looking quite good 😎

I also incorporated the changes from #8807

My next plan is to:

  1. Add more unit tests / fuzzing
  2. Split it into two PRs to make it easier to review: unary and binary

Dandandan pushed a commit that referenced this pull request Dec 17, 2025
…unary` (#8996)

# Which issue does this PR close?


- part of #8806
- broken out from #8854


# Rationale for this change

The current implementation of the unary not kernel has an extra
allocation when operating on sliced data which is not necessary.

Also, we can generate more optimal code by processing u64 words at a
time when the buffer is already u64 aligned (see
#8807)

Also, it is hard to find the code to create new Buffers by copying bits

# What changes are included in this PR?

1. Introduce `BooleanBuffer::from_bitwise_unary` and
`BooleanBuffer::from_bits`
2. Deprecate `bitwise_unary_op_helper`

# Are these changes tested?

Yes with new tests and benchmarks

# Are there any user-facing changes?

new PAPI

---------

Co-authored-by: Martin Hilton <[email protected]>
@Dandandan Dandandan closed this in 96637fc Jan 9, 2026
Dandandan pushed a commit to Dandandan/arrow-rs that referenced this pull request Jan 15, 2026
…er::from_bitwise_binary_op` (apache#9090)

# Which issue does this PR close?

- Part of apache#8806
- Closes apache#8854
- Closes apache#8807


This is the next step after
-  apache#8996

# Rationale for this change

- we can help rust / LLVM generate more optimal code by processing u64
words at a time when the buffer is already u64 aligned (see apache#8807)

Also, it is hard to find the code to create new Buffers by applying
bitwise unary operations.

# What changes are included in this PR?

- Introduce optimized `BooleanBuffer::from_bitwise_binary`
- Migrate several kernels that use `bitwise_bin_op_helper` to use the
new BooleanBuffer


# Are these changes tested?

Yes new tests are added

Performance results show 30% performance improvement for the `and` and
`or` kernels for aligned buffers (common case)

# Are there any user-facing changes?

A new API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants