-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid needless gather in fast_integer_divide lowering #6441
Conversation
fast_integer_divide did two lookups, one for a multiplier, and one for a shift. It turns out you can just use count leading zeros to compute a workable shift instead of having to do a lookup. This PR speeds up use of fast_integer_divide in cases where the denominator varies across vector lanes by ~70% or so by avoiding one of the two expensive gathers.
src/FastIntegerDivide.cpp
Outdated
Buffer<uint8_t> integer_divide_table_u8() { | ||
static std::mutex initialize_lock; | ||
std::lock_guard<std::mutex> lock_guard(initialize_lock); | ||
{ | ||
static Buffer<uint8_t> im(256, 2); | ||
static Buffer<uint8_t> im(256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated: since C++11 onwards default to promising thread-safe initialization of statics, we could rewrite this to be a bit terser like so:
Buffer<uint8_t> integer_divide_table_u8() {
static Buffer<uint8_t> im(256) = []() {
Buffer<uint8_t> im(256);
for (uint32_t i = 0; i < 256; i++) {
im(i) = table_runtime_u8[i][2];
if (i > 1) {
internal_assert(table_runtime_u8[i][3] == shift_for_denominator(i));
}
}
return im;
}();
return im;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will fix.
No, that's compile-time code. |
fast_integer_divide did two lookups, one for a multiplier, and one for a
shift. It turns out you can just use count leading zeros to compute a
workable shift instead of having to do a lookup. This PR speeds up use
of fast_integer_divide in cases where the denominator varies across
vector lanes by ~70% or so by avoiding one of the two expensive gathers.