-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-2649: [C++] Add GenerateBits() function to improve bitmap writing performance #2093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also a GenerateBitsUnrolled() for higher performance where warranted. Benchmarks: - GenerateBits is 1.8x faster than BitmapWriter - GenerateBitsUnrolled is 2.9x faster than BitmapWriter - BooleanBuilder is now 3x faster than with BitmapWriter (and around 9x faster than it was with SetBitTo calls)
Codecov Report
@@ Coverage Diff @@
## master #2093 +/- ##
==========================================
+ Coverage 86.35% 86.36% +0.01%
==========================================
Files 230 230
Lines 40392 40452 +60
==========================================
+ Hits 34880 34937 +57
- Misses 5512 5515 +3
Continue to review full report at Codecov.
|
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this is awesome! I'm OK to merge whenever you're satisfied with the code enough to remove the WIP
| bit_writer.Finish(); | ||
| int64_t i = 0; | ||
| internal::GenerateBitsUnrolled(raw_data_, length_, length, | ||
| [values, &i]() -> bool { return values[i++] != 0; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have often wondered if lambda functions have much overhead vs. inlined functions, is there a good reference on how the various compilers behave?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of Googling suggests that in instances like this (where the type of the lambda is a template argument), the lambda will be inlined https://www.quora.com/Are-C++-lambda-functions-always-inlined). If you passed a lambda into a function accepting an std::function of some kind, it wouldn't be necessarily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, apparently the recommended idiom is to let the callable argument be a template parameter so as to select a favorable specialization.
| int64_t remaining_bytes = remaining / 8; | ||
| while (remaining_bytes-- > 0) { | ||
| current_byte = 0; | ||
| current_byte = g() ? current_byte | 0x01 : current_byte; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiousity, would current_byte = current_byte | (0x01 * static_cast<uint8_t>(g())) have any better performance (to avoid branching)? I guess it's possible the compiler is doing some kind of optimization anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried it quickly and, while the BooleanBuilder benchmark isn't affected, the bit-util microbenchmark became 2x faster. I'm wondering whether in this trivial case, perhaps the whole thing is SIMDed by the compiler. I should take a closer look.
(this is with gcc 4.9 on an AMD Ryzen)
|
+1, merging this. We can do further performance explorations in follow up patches |
Also a GenerateBitsUnrolled() for higher performance where warranted.
Benchmarks:
(and around 9x faster than it was with SetBitTo calls)