-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rounding shift rights should use rounding halving add #6494
Conversation
On x86 currently we lower cast<uint8_t>((cast<uint16_t>(x) + 8) / 16) to: cast<uint8_t>(shift_right(widening_add(x, 8), 4)) This compiles to 8 instructions on x86: Widen each half of the input vector, add 8 to each half-vector, shift each half-vector, then narrow each half-vector. First, this should have been a rounding_shift_right. Some patterns were missing in FindIntrinsics. Second, rounding_shift_right had suboptimal codegen in the case where the second arg is a positive const. On archs without a rounding shift right instruction you can further rewrite this to: shift_right(rounding_halving_add(x, 7), 3) which is just two instructions on x86.
Dillon, I have a question for you in the IRMatch.h code. I'm a bit confused about the types of the second arg to shifts. |
src/IRMatch.h
Outdated
// Assuming the args have the same type as the intrinsic is incorrect in | ||
// general. But for the intrinsics we can fold (just shifts), the LHS | ||
// has the same type as the intrinsic, and we can always treat the RHS | ||
// as a signed int, because we're using 64 bits for it. (TODO: Dillon, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is OK. Constants will be simplified to unsigned shifts if needed.
test/correctness/simd_op_check.cpp
Outdated
@@ -150,6 +150,13 @@ class SimdOpCheck : public SimdOpCheckTest { | |||
check("pavgb", 8 * w, u8((u16(u8_1) + u16(u8_2) + 1) >> 1)); | |||
check("pavgw", 4 * w, u16((u32(u16_1) + u32(u16_2) + 1) / 2)); | |||
check("pavgw", 4 * w, u16((u32(u16_1) + u32(u16_2) + 1) >> 1)); | |||
|
|||
// Rounding right shifts should also use pavg | |||
check("pavgb", 8 * w, u8((u16(u8_1) + 15) >> 4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these added checks should be in test/correctness/intrinsics.cpp instead (much lighter weight test than simd_op_check and it is target independent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a test that rounding_shift_right is both caught by pattern matching and also lowers correctly on x86, so I'll move half of it to intrinsics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't the two tests below this the same test after accounting for the pattern matching part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully the latest commit makes it clearer. The first two tests aren't just a rounding shift right. they strength-reduce to an average followed by a shift right. So it tests the new patterns.
The second test checks lower_rounding_shift_right does the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, I hadn't noticed the constant isn't a rounding_shift_right. Still, I think the change you just made is good (easier tests to debug if they fail).
rewrite(min(shift_right(widening_mul(x, y), z), upper), mul_shift_right(x, y, cast(unsigned_type, z)), is_x_same_uint && is_uint(z)) || | ||
rewrite(min(rounding_shift_right(widening_mul(x, y), z), upper), rounding_mul_shift_right(x, y, cast(unsigned_type, z)), is_x_same_uint && is_uint(z)) || | ||
if ( | ||
// Saturating patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side comment, the reformatting here makes this change a lot harder to review easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, may I suggest restoring original format and putting clang-format off
back in place? If we want to let these be 'naturally' reformatted then IMHO we should do that in a standalone PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will in future, but here I just manually reformatted a few surrounding lines because they were hard to modify. What I'd done happened to agree with clang-format, so I also removed those comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general rewrite rules are exempt from clang-format, but these particular ones use named intrinsics, so they get very long indeed.
On x86 currently we lower cast<uint8_t>((cast<uint16_t>(x) + 8) / 16)
to:
cast<uint8_t>(shift_right(widening_add(x, 8), 4))
This compiles to 8 instructions on x86: Widen each half of the input
vector, add 8 to each half-vector, shift each half-vector, then narrow
each half-vector.
First, this should have been a rounding_shift_right. Some patterns were
missing in FindIntrinsics.
Second, rounding_shift_right had suboptimal codegen in the case where
the second arg is a positive const. On archs without a rounding shift
right instruction you can further rewrite this to:
shift_right(rounding_halving_add(x, 7), 3)
which is just two instructions on x86.