-
Notifications
You must be signed in to change notification settings - Fork 58
perf: added an optimized impl of PartialEq to Mv
#1400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
29d1d51 to
04f430b
Compare
djc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I'll let @kkysen take a look before merging.
kkysen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some little nits about the comments.
kkysen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this change fundamentally looks good, correct, and more optimized, but I think it's worth investigating this further, and thank you for finding this.
It appears Rust and LLVM optimize #[derive(PartialEq)] very poorly. It emits field by field comparisons as expected, but unexpectedly these aren't coalesced into a single comparison as your manual impls are doing, but I don't see any reason why this wouldn't be a valid optimization rustc and LLVM could do (and if it wasn't, your manual optimization wouldn't be valid either).
Isolating this to just Mv in godbolt
#[derive(Clone, Copy, PartialEq, Eq, Default)]
#[repr(C, align(4))]
pub struct Mv1 {
pub y: i16,
pub x: i16,
}
#[unsafe(no_mangle)]
pub fn f1(a: &Mv1, b: &Mv1) -> bool {
a == b
}
impl PartialEq for Mv2 {
#[inline(always)]
fn eq(&self, other: &Self) -> bool {
let this: u32 = unsafe { std::mem::transmute(*self) };
let other: u32 = unsafe { std::mem::transmute(*other) };
this == other
}
}
#[derive(Clone, Copy, Eq, Default)]
#[repr(C, align(4))]
pub struct Mv2 {
pub y: i16,
pub x: i16,
}
#[unsafe(no_mangle)]
pub fn f2(a: &Mv2, b: &Mv2) -> bool {
a == b
}I get this asm on x86_64:
f1:
movd xmm0, dword ptr [rdi]
movd xmm1, dword ptr [rsi]
pcmpeqw xmm1, xmm0
pshuflw xmm0, xmm1, 80
pshufd xmm0, xmm0, 80
movmskpd eax, xmm0
cmp eax, 3
sete al
ret
f2:
mov eax, dword ptr [rdi]
cmp eax, dword ptr [rsi]
sete al
retand this on aarch64:
f1:
ldrh w8, [x0]
ldrh w9, [x1]
ldrh w10, [x0, #2]
ldrh w11, [x1, #2]
cmp w8, w9
and w8, w10, #0xffff
and w9, w11, #0xffff
ccmp w8, w9, #0, eq
cset w0, eq
ret
f2:
ldr w8, [x0]
ldr w9, [x1]
cmp w8, w9
cset w0, eq
retI think the aarch64 is a bit easier to read, and we can see it doing separate 16-bit loads and a masked-to-16-bits comparison. However, I don't see any reason why it couldn't be optimized as f2 has been. One possible reason could be the alignment, but I tried adding #[repr(align(4))] to Mv and it didn't make a difference[^1].
This seems to me like an optimization bug in rustc and/or LLVM, as it should have all the information needed to soundly optimize it, and this is something that comes up a lot in Rust. Should we open an issue?
Also, the fact that it's happening here to these types means we probably should look at all of our PartialEq impls and see how they're being optimized. Which is very annoying that we have to do that, and why I wish the compiler optimized better here.
This seems similar to #1332 in a way, just for PartialEq instead of Clone/Copy.
[^1] Should we add this to your PR, too? Can't the 32-bit load be misaligned, which isn't always allowed? Does a C union with a uint32_t have the same alignment as a uint32_t? If that's the case (not completely sure off the top of my head), then we should add that to the Rust (and anywhere else I did this kind of union transformation).
|
Look at this, too (godbolt): pub fn f3(a: &[i16; 2], b: &[i16; 2]) -> bool {
a == b
}
pub fn f4(a: &[i16; 2], b: &[i16; 2]) -> bool {
a[0] == b[0] && a[1] == b[1]
}
pub fn f5(a: &(i16, i16), b: &(i16, i16)) -> bool {
a == b
}
|
|
Definitely worth creating an upstream issue, though it seems likely this is eventually closer to LLVM's purview than rustc's. |
04f430b to
f535f15
Compare
|
@kkysen great writeup! I also tested a few other variants in godbolt (like
I did a quick look before opening the PR: I think there are a few others that are used in |
|
We can do this safely with the existing traits: impl PartialEq for RefMvsMvPair {
fn eq(&self, other: &Self) -> bool {
self.as_bytes() == other.as_bytes()
}
}I'm not especially qualified to express an opinion on whether it should be optimised at the LLVM level. I note that in compiler explorer, this C code doesn't get optimised to a single compare in either gcc or clang - it's two compares in both. #include <stdint.h>
#include <stdbool.h>
struct a {
int16_t x;
int16_t y;
};
const struct a INVALID = {.x=-1, .y=-1};
bool is_invalid(struct a *a) {
return a->x == INVALID.x && a->y == INVALID.y;
}If we're not allowed merge the compares in C, I'm guessing we're not allowed to merge the compares in a |
|
This is better answered by actual Rust opsem gurus, but my understanding is that this is a Rustc choice rather than LLVM bug, and it comes from edge cases regarding uninitialized memory. As in, in C it is possible to pass a |
@daxtens, that's a cleaner way of doing. Thanks! @ohadravid, this would be preferable I think. Simpler and more generic. |
Ahh, that makes a lot more sense now. I'm not sure why Rust wouldn't want to emit those |
src/levels.rs
Outdated
| // In C, `mv` is a union of either two int16_t values or a uint32_t, | ||
| // with the uint32_t variant used for comparisons as it is faster. | ||
| let this: u32 = transmute!(*self); | ||
| let other: u32 = transmute!(*other); | ||
|
|
||
| this == other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // In C, `mv` is a union of either two int16_t values or a uint32_t, | |
| // with the uint32_t variant used for comparisons as it is faster. | |
| let this: u32 = transmute!(*self); | |
| let other: u32 = transmute!(*other); | |
| this == other | |
| self.as_bytes() == other.as_bytes() |
Same for the others, as @daxtens suggested.
I'll think about what we should leave for a comment. I think you can remove the comments you left for now, as the issue is a bit more generic and subtle than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kkysen done, let me know how you want to document this. Maybe "Bytewise comparison optimizes better than per-field comparison"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be mostly about not comparing per-field with &&. I'll give a suggestion in a new comment.
On review, I have been mostly wrong, sorry. Looked some more into it (https://godbolt.org/z/vE6MK64Yf), and in this specific case Rust does initially add Either way, looks like Rust already tracks an identical case under rust-lang/rust#140167 . |
f535f15 to
46b7deb
Compare
46b7deb to
b545fea
Compare
|
@kkysen squashed. Let me know how you want to document this. Maybe "Bytewise comparison optimizes better than per-field comparison"? |
src/levels.rs
Outdated
| // In C, `mv` is a union of either two int16_t values or a uint32_t, | ||
| // with the uint32_t variant used for comparisons as it is faster. | ||
| let this: u32 = transmute!(*self); | ||
| let other: u32 = transmute!(*other); | ||
|
|
||
| this == other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be mostly about not comparing per-field with &&. I'll give a suggestion in a new comment.
|
@kkysen documented, aligned, squashed 😄 |
b545fea to
3db6886
Compare
Another small improvement I noticed while comparing perf to
dav1d.This improves the performance of
add_temporal_candidateby ~20% (on M3), so it now matches the one fromdav1d(to less than 0.1% diff), as measured when decoding Chimera-AV1-8bit-1920x1080-6736kbps.ivf.The overall improvement is about 0.5%.
Edit: also for
RefMvs{Mv,Ref}Pair, which seems to have a smaller effect.