perf: added an optimized impl of `PartialEq` to `Mv` #1400

ohadravid · 2025-05-18T19:11:30Z

Another small improvement I noticed while comparing perf to dav1d.

This improves the performance of add_temporal_candidate by ~20% (on M3), so it now matches the one from dav1d (to less than 0.1% diff), as measured when decoding Chimera-AV1-8bit-1920x1080-6736kbps.ivf.

The overall improvement is about 0.5%.

Edit: also for RefMvs{Mv,Ref}Pair, which seems to have a smaller effect.

djc

LGTM, but I'll let @kkysen take a look before merging.

kkysen

Some little nits about the comments.

src/refmvs.rs

kkysen

So this change fundamentally looks good, correct, and more optimized, but I think it's worth investigating this further, and thank you for finding this.

It appears Rust and LLVM optimize #[derive(PartialEq)] very poorly. It emits field by field comparisons as expected, but unexpectedly these aren't coalesced into a single comparison as your manual impls are doing, but I don't see any reason why this wouldn't be a valid optimization rustc and LLVM could do (and if it wasn't, your manual optimization wouldn't be valid either).

Isolating this to just Mv in godbolt

#[derive(Clone, Copy, PartialEq, Eq, Default)]
#[repr(C, align(4))]
pub struct Mv1 {
    pub y: i16,
    pub x: i16,
}

#[unsafe(no_mangle)]
pub fn f1(a: &Mv1, b: &Mv1) -> bool {
    a == b
}

impl PartialEq for Mv2 {
    #[inline(always)]
    fn eq(&self, other: &Self) -> bool {
        let this: u32 = unsafe { std::mem::transmute(*self) };
        let other: u32 = unsafe { std::mem::transmute(*other) };

        this == other
    }
}

#[derive(Clone, Copy, Eq, Default)]
#[repr(C, align(4))]
pub struct Mv2 {
    pub y: i16,
    pub x: i16,
}

#[unsafe(no_mangle)]
pub fn f2(a: &Mv2, b: &Mv2) -> bool {
    a == b
}

I get this asm on x86_64:

f1:
        movd    xmm0, dword ptr [rdi]
        movd    xmm1, dword ptr [rsi]
        pcmpeqw xmm1, xmm0
        pshuflw xmm0, xmm1, 80
        pshufd  xmm0, xmm0, 80
        movmskpd        eax, xmm0
        cmp     eax, 3
        sete    al
        ret

f2:
        mov     eax, dword ptr [rdi]
        cmp     eax, dword ptr [rsi]
        sete    al
        ret

and this on aarch64:

f1:
        ldrh    w8, [x0]
        ldrh    w9, [x1]
        ldrh    w10, [x0, #2]
        ldrh    w11, [x1, #2]
        cmp     w8, w9
        and     w8, w10, #0xffff
        and     w9, w11, #0xffff
        ccmp    w8, w9, #0, eq
        cset    w0, eq
        ret

f2:
        ldr     w8, [x0]
        ldr     w9, [x1]
        cmp     w8, w9
        cset    w0, eq
        ret

I think the aarch64 is a bit easier to read, and we can see it doing separate 16-bit loads and a masked-to-16-bits comparison. However, I don't see any reason why it couldn't be optimized as f2 has been. One possible reason could be the alignment, but I tried adding #[repr(align(4))] to Mv and it didn't make a difference[^1].

This seems to me like an optimization bug in rustc and/or LLVM, as it should have all the information needed to soundly optimize it, and this is something that comes up a lot in Rust. Should we open an issue?

Also, the fact that it's happening here to these types means we probably should look at all of our PartialEq impls and see how they're being optimized. Which is very annoying that we have to do that, and why I wish the compiler optimized better here.

This seems similar to #1332 in a way, just for PartialEq instead of Clone/Copy.

[^1] Should we add this to your PR, too? Can't the 32-bit load be misaligned, which isn't always allowed? Does a C union with a uint32_t have the same alignment as a uint32_t? If that's the case (not completely sure off the top of my head), then we should add that to the Rust (and anywhere else I did this kind of union transformation).

kkysen · 2025-05-19T08:20:12Z

Look at this, too (godbolt):

pub fn f3(a: &[i16; 2], b: &[i16; 2]) -> bool {
    a == b
}

pub fn f4(a: &[i16; 2], b: &[i16; 2]) -> bool {
    a[0] == b[0] && a[1] == b[1]
}

pub fn f5(a: &(i16, i16), b: &(i16, i16)) -> bool {
    a == b
}

f3 (array aggregate comparison) optimizes correctly, but f4 (array element-wise comparison) and f5 (tuple aggregate comparison) optimize incorrectly.

djc · 2025-05-19T09:51:09Z

Definitely worth creating an upstream issue, though it seems likely this is eventually closer to LLVM's purview than rustc's.

ohadravid · 2025-05-19T11:46:59Z

@kkysen great writeup! I also tested a few other variants in godbolt (like [self.x, self.y] == [other.x, other.y]), but the only one that was optimized correctly was the transmute one. Maybe this is related to short circuiting?

anywhere else I did this kind of union transformation

I did a quick look before opening the PR:
A very similar same thing is used in LooprestorationParams, which already has the LooprestorationParamsSgr Rust version that does this.

I think there are a few others that are used in decode_b which look more complex, and I don't know if they are used for comparisons in the same way. Might look at it later if I have the time 😄

daxtens · 2025-05-19T16:28:49Z

We can do this safely with the existing traits:

impl PartialEq for RefMvsMvPair {
    fn eq(&self, other: &Self) -> bool {
        self.as_bytes() == other.as_bytes()
    }
}

I'm not especially qualified to express an opinion on whether it should be optimised at the LLVM level. I note that in compiler explorer, this C code doesn't get optimised to a single compare in either gcc or clang - it's two compares in both.

#include <stdint.h>
#include <stdbool.h>

struct a {
    int16_t x;
    int16_t y;
};

const struct a INVALID = {.x=-1, .y=-1};

bool is_invalid(struct a *a) {
    return a->x == INVALID.x && a->y == INVALID.y;
}

If we're not allowed merge the compares in C, I'm guessing we're not allowed to merge the compares in a repr(C) struct, but I couldn't even speculate as the the rationale.

adrian17 · 2025-05-19T17:12:39Z

This is better answered by actual Rust opsem gurus, but my understanding is that this is a Rustc choice rather than LLVM bug, and it comes from edge cases regarding uninitialized memory.

As in, in C it is possible to pass a struct a *a where a->x is initialized, but a->y is uninitialized. In that case, comparing them separately doesn't always produce UB (as the branch comparing a->y might not be reached), while comparing them as u32 will always be UB, so it's illegal for the compiler to optimize one into the other.
Rustc currently also doesn't tell LLVM that a &T reference is guaranteed to point to fully initialized data (if I manually add !noundef to the load instructions in LLVM IR, alive2 confirms that only then the optimization would be correct) and AFAIK it's still an ongoing question whether it's something they actually want to allow or not.

kkysen · 2025-05-20T10:34:37Z

We can do this safely with the existing traits:

impl PartialEq for RefMvsMvPair {
    fn eq(&self, other: &Self) -> bool {
        self.as_bytes() == other.as_bytes()
    }
}

@daxtens, that's a cleaner way of doing. Thanks!

@ohadravid, this would be preferable I think. Simpler and more generic.

kkysen · 2025-05-20T10:42:51Z

This is better answered by actual Rust opsem gurus, but my understanding is that this is a Rustc choice rather than LLVM bug, and it comes from edge cases regarding uninitialized memory.

As in, in C it is possible to pass a struct a *a where a->x is initialized, but a->y is uninitialized. In that case, comparing them separately doesn't always produce UB (as the branch comparing a->y might not be reached), while comparing them as u32 will always be UB, so it's illegal for the compiler to optimize one into the other. Rustc currently also doesn't tell LLVM that a &T reference is guaranteed to point to fully initialized data (if I manually add !undef to the load instructions in LLVM IR, alive2 confirms that only then the optimization would be correct) and AFAIK it's still an ongoing question whether it's something they actually want to allow or not.

Ahh, that makes a lot more sense now. I'm not sure why Rust wouldn't want to emit those !undefs, though. Do you know where that discussion might be happening?

kkysen · 2025-05-20T10:45:15Z

src/levels.rs

+        // In C, `mv` is a union of either two int16_t values or a uint32_t,
+        // with the uint32_t variant used for comparisons as it is faster.
+        let this: u32 = transmute!(*self);
+        let other: u32 = transmute!(*other);
+
+        this == other


Suggested change

// In C, `mv` is a union of either two int16_t values or a uint32_t,

// with the uint32_t variant used for comparisons as it is faster.

let this: u32 = transmute!(*self);

let other: u32 = transmute!(*other);

this == other

self.as_bytes() == other.as_bytes()

Same for the others, as @daxtens suggested.

I'll think about what we should leave for a comment. I think you can remove the comments you left for now, as the issue is a bit more generic and subtle than that.

@kkysen done, let me know how you want to document this. Maybe "Bytewise comparison optimizes better than per-field comparison"?

It seems to be mostly about not comparing per-field with &&. I'll give a suggestion in a new comment.

adrian17 · 2025-05-20T12:25:14Z

Ahh, that makes a lot more sense now. I'm not sure why Rust wouldn't want to emit those !undefs, though. Do you know where that discussion might be happening?

On review, I have been mostly wrong, sorry.

Looked some more into it (https://godbolt.org/z/vE6MK64Yf), and in this specific case Rust does initially add !noundef on all loads after all (later SimplifyCfg pass removes some for valid reasons). It might instead be the case of the vectorizer being overly zealous...? In fact, LLVM does have a pass intended to merge comparisons (MergeICmps), but it doesn't look like it got to do anything here (maybe because the vectorizer ran first, but I was unable to get it to optimize the pre-vectorization IR either).

Either way, looks like Rust already tracks an identical case under rust-lang/rust#140167 .

ohadravid · 2025-05-21T08:33:08Z

@kkysen squashed. Let me know how you want to document this. Maybe "Bytewise comparison optimizes better than per-field comparison"?

src/levels.rs

src/refmvs.rs

kkysen · 2025-05-25T07:45:27Z

src/levels.rs

+        // In C, `mv` is a union of either two int16_t values or a uint32_t,
+        // with the uint32_t variant used for comparisons as it is faster.
+        let this: u32 = transmute!(*self);
+        let other: u32 = transmute!(*other);
+
+        this == other


It seems to be mostly about not comparing per-field with &&. I'll give a suggestion in a new comment.

src/levels.rs

ohadravid · 2025-05-25T08:48:53Z

@kkysen documented, aligned, squashed 😄

ohadravid force-pushed the perf/mv-micro-optim branch from 29d1d51 to 04f430b Compare May 18, 2025 19:28

kkysen self-requested a review May 19, 2025 00:07

kkysen added the performance label May 19, 2025

djc approved these changes May 19, 2025

View reviewed changes

kkysen requested changes May 19, 2025

View reviewed changes

src/refmvs.rs Outdated Show resolved Hide resolved

src/refmvs.rs Outdated Show resolved Hide resolved

kkysen reviewed May 19, 2025

View reviewed changes

ohadravid force-pushed the perf/mv-micro-optim branch from 04f430b to f535f15 Compare May 19, 2025 11:26

kkysen requested changes May 20, 2025

View reviewed changes

ohadravid force-pushed the perf/mv-micro-optim branch from f535f15 to 46b7deb Compare May 20, 2025 13:26

ohadravid requested a review from kkysen May 20, 2025 13:32

ohadravid force-pushed the perf/mv-micro-optim branch from 46b7deb to b545fea Compare May 20, 2025 13:59

jrmuizel mentioned this pull request May 20, 2025

Bad codegen for comparing struct of two 16bit ints rust-lang/rust#140167

Open

joshlf reviewed May 22, 2025

View reviewed changes

src/levels.rs Show resolved Hide resolved

kkysen requested changes May 25, 2025

View reviewed changes

perf: added optimized PartialEq impl for Mv, RefMvs{Mv,Ref}Pair

3db6886

ohadravid requested a review from kkysen May 25, 2025 08:48

ohadravid force-pushed the perf/mv-micro-optim branch from b545fea to 3db6886 Compare May 25, 2025 08:49

kkysen approved these changes May 25, 2025

View reviewed changes

kkysen merged commit 1be76ea into memorysafety:main May 25, 2025
28 checks passed

daxtens mentioned this pull request Jun 16, 2025

perf: Further optimise the PartialEq implementations for Mv, RefMvsRefPair and RefMvsMvPair #1432

Closed

perf: added an optimized impl of PartialEq to Mv #1400

perf: added an optimized impl of PartialEq to Mv #1400

Uh oh!

Conversation

ohadravid commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djc left a comment

Choose a reason for hiding this comment

Uh oh!

kkysen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kkysen left a comment

Choose a reason for hiding this comment

Uh oh!

kkysen commented May 19, 2025

Uh oh!

djc commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ohadravid commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daxtens commented May 19, 2025 • edited by kkysen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrian17 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkysen commented May 20, 2025

Uh oh!

kkysen commented May 20, 2025

Uh oh!

kkysen May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ohadravid May 20, 2025

Choose a reason for hiding this comment

Uh oh!

kkysen May 25, 2025

Choose a reason for hiding this comment

Uh oh!

adrian17 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ohadravid commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kkysen May 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ohadravid commented May 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

perf: added an optimized impl of `PartialEq` to `Mv` #1400

perf: added an optimized impl of `PartialEq` to `Mv` #1400

ohadravid commented May 18, 2025 •

edited

Loading

djc commented May 19, 2025 •

edited

Loading

ohadravid commented May 19, 2025 •

edited

Loading

daxtens commented May 19, 2025 •

edited by kkysen

Loading

adrian17 commented May 19, 2025 •

edited

Loading

adrian17 commented May 20, 2025 •

edited

Loading