Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented May 31, 2018

Also a GenerateBitsUnrolled() for higher performance where warranted.

Benchmarks:

  • GenerateBits is 1.8x faster than BitmapWriter
  • GenerateBitsUnrolled is 2.9x faster than BitmapWriter
  • BooleanBuilder is now 3x faster than with BitmapWriter
    (and around 9x faster than it was with SetBitTo calls)

Also a GenerateBitsUnrolled() for higher performance where warranted.

Benchmarks:
- GenerateBits is 1.8x faster than BitmapWriter
- GenerateBitsUnrolled is 2.9x faster than BitmapWriter
- BooleanBuilder is now 3x faster than with BitmapWriter
  (and around 9x faster than it was with SetBitTo calls)
@codecov-io
Copy link

Codecov Report

Merging #2093 into master will increase coverage by 0.01%.
The diff coverage is 97.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2093      +/-   ##
==========================================
+ Coverage   86.35%   86.36%   +0.01%     
==========================================
  Files         230      230              
  Lines       40392    40452      +60     
==========================================
+ Hits        34880    34937      +57     
- Misses       5512     5515       +3
Impacted Files Coverage Δ
cpp/src/arrow/compute/kernels/cast.cc 89.35% <100%> (-0.14%) ⬇️
cpp/src/arrow/builder.cc 81.79% <100%> (-0.43%) ⬇️
cpp/src/arrow/util/bit-util.h 98.5% <100%> (+0.49%) ⬆️
cpp/src/arrow/util/bit-util-test.cc 99.45% <94.28%> (-0.55%) ⬇️
cpp/src/arrow/util/thread-pool-test.cc 98.91% <0%> (-0.55%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d19089e...0ef0d12. Read the comment docs.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this is awesome! I'm OK to merge whenever you're satisfied with the code enough to remove the WIP

bit_writer.Finish();
int64_t i = 0;
internal::GenerateBitsUnrolled(raw_data_, length_, length,
[values, &i]() -> bool { return values[i++] != 0; });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have often wondered if lambda functions have much overhead vs. inlined functions, is there a good reference on how the various compilers behave?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of Googling suggests that in instances like this (where the type of the lambda is a template argument), the lambda will be inlined https://www.quora.com/Are-C++-lambda-functions-always-inlined). If you passed a lambda into a function accepting an std::function of some kind, it wouldn't be necessarily

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, apparently the recommended idiom is to let the callable argument be a template parameter so as to select a favorable specialization.

int64_t remaining_bytes = remaining / 8;
while (remaining_bytes-- > 0) {
current_byte = 0;
current_byte = g() ? current_byte | 0x01 : current_byte;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiousity, would current_byte = current_byte | (0x01 * static_cast<uint8_t>(g())) have any better performance (to avoid branching)? I guess it's possible the compiler is doing some kind of optimization anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried it quickly and, while the BooleanBuilder benchmark isn't affected, the bit-util microbenchmark became 2x faster. I'm wondering whether in this trivial case, perhaps the whole thing is SIMDed by the compiler. I should take a closer look.

(this is with gcc 4.9 on an AMD Ryzen)

@wesm wesm changed the title [WIP] ARROW-2649: [C++] Add GenerateBits() function ARROW-2649: [C++] Add GenerateBits() function Jun 8, 2018
@wesm wesm changed the title ARROW-2649: [C++] Add GenerateBits() function ARROW-2649: [C++] Add GenerateBits() function to improve bitmap writing performance Jun 8, 2018
@wesm
Copy link
Member

wesm commented Jun 8, 2018

+1, merging this. We can do further performance explorations in follow up patches

@wesm wesm closed this in 27b869a Jun 8, 2018
@pitrou pitrou deleted the ARROW-2649-generate-bits branch March 2, 2021 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants