-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Improve BlockEncoding null bits packing #26437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve BlockEncoding null bits packing #26437
Conversation
|
It this PR about reducing the network overhead for nulls or does it improve performance due to autovectorization as well? |
d2ace22 to
bec67f3
Compare
This PR doesn't change the serialized representation (and therefore has no impact on network overhead), the performance comes from better JIT code generation for the serialization loops, although not necessarily autovectorization specifically. |
bec67f3 to
d85b614
Compare
d85b614 to
3912ab0
Compare
|
@pettyjamesm since you've shared benchmark results, can you contribute benchmark itself too? |
| int currentByte = 0; | ||
| for (int position = 0; position < (length & ~0b111); position += 8) { | ||
| byte value = 0; | ||
| value |= (isNull[position + offset] ? 1 : 0) << 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it faster than:
value |= block.isNull(position) ? 0b1000_0000 : 0;
value |= block.isNull(position + 1) ? 0b0100_0000 : 0;
value |= block.isNull(position + 2) ? 0b0010_0000 : 0;
value |= block.isNull(position + 3) ? 0b0001_0000 : 0;
value |= block.isNull(position + 4) ? 0b0000_1000 : 0;
value |= block.isNull(position + 5) ? 0b0000_0100 : 0;
value |= block.isNull(position + 6) ? 0b0000_0010 : 0;
value |= block.isNull(position + 7) ? 0b0000_0001 : 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the new version performs better (and more consistently) than the previous implementation across all null rates in the benchmark including when branches are (nearly) perfectly predicted (null rate 0.01 and 0.99). The performance of the new version is nearly flat (with some variation in performance coming from the logic performing null-supression of the values array) where as we see a typical "M"-shaped curve with the previous version where performance relies on the branch prediction rate- which suggests the new version is branchless although unfortunately I don't have a JVM build with hdsdis available to show the generated assembly to prove it.
The benchmark is already checked in as |
Also improves null packing implementation to avoid conditional branches in the encoding logic.
3912ab0 to
b6f323c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Description
Improve block serialization logic to prefer using primitive arrays over virtual calls when checking positions for nulls.
Benchmarks: BenchmarkBlockSerde.serialize*
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: