Skip to content

Conversation

@ohadravid
Copy link
Contributor

@ohadravid ohadravid commented May 18, 2025

Another small improvement I noticed while comparing perf to dav1d.

This improves the performance of add_temporal_candidate by ~20% (on M3), so it now matches the one from dav1d (to less than 0.1% diff), as measured when decoding Chimera-AV1-8bit-1920x1080-6736kbps.ivf.

The overall improvement is about 0.5%.

Edit: also for RefMvs{Mv,Ref}Pair, which seems to have a smaller effect.

@ohadravid ohadravid force-pushed the perf/mv-micro-optim branch from 29d1d51 to 04f430b Compare May 18, 2025 19:28
@kkysen kkysen self-requested a review May 19, 2025 00:07
Copy link
Collaborator

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'll let @kkysen take a look before merging.

Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some little nits about the comments.

Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this change fundamentally looks good, correct, and more optimized, but I think it's worth investigating this further, and thank you for finding this.

It appears Rust and LLVM optimize #[derive(PartialEq)] very poorly. It emits field by field comparisons as expected, but unexpectedly these aren't coalesced into a single comparison as your manual impls are doing, but I don't see any reason why this wouldn't be a valid optimization rustc and LLVM could do (and if it wasn't, your manual optimization wouldn't be valid either).

Isolating this to just Mv in godbolt

#[derive(Clone, Copy, PartialEq, Eq, Default)]
#[repr(C, align(4))]
pub struct Mv1 {
    pub y: i16,
    pub x: i16,
}

#[unsafe(no_mangle)]
pub fn f1(a: &Mv1, b: &Mv1) -> bool {
    a == b
}

impl PartialEq for Mv2 {
    #[inline(always)]
    fn eq(&self, other: &Self) -> bool {
        let this: u32 = unsafe { std::mem::transmute(*self) };
        let other: u32 = unsafe { std::mem::transmute(*other) };

        this == other
    }
}

#[derive(Clone, Copy, Eq, Default)]
#[repr(C, align(4))]
pub struct Mv2 {
    pub y: i16,
    pub x: i16,
}

#[unsafe(no_mangle)]
pub fn f2(a: &Mv2, b: &Mv2) -> bool {
    a == b
}

I get this asm on x86_64:

f1:
        movd    xmm0, dword ptr [rdi]
        movd    xmm1, dword ptr [rsi]
        pcmpeqw xmm1, xmm0
        pshuflw xmm0, xmm1, 80
        pshufd  xmm0, xmm0, 80
        movmskpd        eax, xmm0
        cmp     eax, 3
        sete    al
        ret

f2:
        mov     eax, dword ptr [rdi]
        cmp     eax, dword ptr [rsi]
        sete    al
        ret

and this on aarch64:

f1:
        ldrh    w8, [x0]
        ldrh    w9, [x1]
        ldrh    w10, [x0, #2]
        ldrh    w11, [x1, #2]
        cmp     w8, w9
        and     w8, w10, #0xffff
        and     w9, w11, #0xffff
        ccmp    w8, w9, #0, eq
        cset    w0, eq
        ret

f2:
        ldr     w8, [x0]
        ldr     w9, [x1]
        cmp     w8, w9
        cset    w0, eq
        ret

I think the aarch64 is a bit easier to read, and we can see it doing separate 16-bit loads and a masked-to-16-bits comparison. However, I don't see any reason why it couldn't be optimized as f2 has been. One possible reason could be the alignment, but I tried adding #[repr(align(4))] to Mv and it didn't make a difference[^1].

This seems to me like an optimization bug in rustc and/or LLVM, as it should have all the information needed to soundly optimize it, and this is something that comes up a lot in Rust. Should we open an issue?

Also, the fact that it's happening here to these types means we probably should look at all of our PartialEq impls and see how they're being optimized. Which is very annoying that we have to do that, and why I wish the compiler optimized better here.

This seems similar to #1332 in a way, just for PartialEq instead of Clone/Copy.

[^1] Should we add this to your PR, too? Can't the 32-bit load be misaligned, which isn't always allowed? Does a C union with a uint32_t have the same alignment as a uint32_t? If that's the case (not completely sure off the top of my head), then we should add that to the Rust (and anywhere else I did this kind of union transformation).

@kkysen
Copy link
Collaborator

kkysen commented May 19, 2025

Look at this, too (godbolt):

pub fn f3(a: &[i16; 2], b: &[i16; 2]) -> bool {
    a == b
}

pub fn f4(a: &[i16; 2], b: &[i16; 2]) -> bool {
    a[0] == b[0] && a[1] == b[1]
}

pub fn f5(a: &(i16, i16), b: &(i16, i16)) -> bool {
    a == b
}

f3 (array aggregate comparison) optimizes correctly, but f4 (array element-wise comparison) and f5 (tuple aggregate comparison) optimize incorrectly.

@djc
Copy link
Collaborator

djc commented May 19, 2025

Definitely worth creating an upstream issue, though it seems likely this is eventually closer to LLVM's purview than rustc's.

@ohadravid ohadravid force-pushed the perf/mv-micro-optim branch from 04f430b to f535f15 Compare May 19, 2025 11:26
@ohadravid
Copy link
Contributor Author

ohadravid commented May 19, 2025

@kkysen great writeup! I also tested a few other variants in godbolt (like [self.x, self.y] == [other.x, other.y]), but the only one that was optimized correctly was the transmute one. Maybe this is related to short circuiting?

anywhere else I did this kind of union transformation

I did a quick look before opening the PR:
A very similar same thing is used in LooprestorationParams, which already has the LooprestorationParamsSgr Rust version that does this.

I think there are a few others that are used in decode_b which look more complex, and I don't know if they are used for comparisons in the same way. Might look at it later if I have the time 😄

@daxtens
Copy link
Contributor

daxtens commented May 19, 2025

We can do this safely with the existing traits:

impl PartialEq for RefMvsMvPair {
    fn eq(&self, other: &Self) -> bool {
        self.as_bytes() == other.as_bytes()
    }
}

I'm not especially qualified to express an opinion on whether it should be optimised at the LLVM level. I note that in compiler explorer, this C code doesn't get optimised to a single compare in either gcc or clang - it's two compares in both.

#include <stdint.h>
#include <stdbool.h>

struct a {
    int16_t x;
    int16_t y;
};

const struct a INVALID = {.x=-1, .y=-1};

bool is_invalid(struct a *a) {
    return a->x == INVALID.x && a->y == INVALID.y;
}

If we're not allowed merge the compares in C, I'm guessing we're not allowed to merge the compares in a repr(C) struct, but I couldn't even speculate as the the rationale.

@adrian17
Copy link

adrian17 commented May 19, 2025

This is better answered by actual Rust opsem gurus, but my understanding is that this is a Rustc choice rather than LLVM bug, and it comes from edge cases regarding uninitialized memory.

As in, in C it is possible to pass a struct a *a where a->x is initialized, but a->y is uninitialized. In that case, comparing them separately doesn't always produce UB (as the branch comparing a->y might not be reached), while comparing them as u32 will always be UB, so it's illegal for the compiler to optimize one into the other.
Rustc currently also doesn't tell LLVM that a &T reference is guaranteed to point to fully initialized data (if I manually add !noundef to the load instructions in LLVM IR, alive2 confirms that only then the optimization would be correct) and AFAIK it's still an ongoing question whether it's something they actually want to allow or not.

@kkysen
Copy link
Collaborator

kkysen commented May 20, 2025

We can do this safely with the existing traits:

impl PartialEq for RefMvsMvPair {
    fn eq(&self, other: &Self) -> bool {
        self.as_bytes() == other.as_bytes()
    }
}

@daxtens, that's a cleaner way of doing. Thanks!

@ohadravid, this would be preferable I think. Simpler and more generic.

@kkysen
Copy link
Collaborator

kkysen commented May 20, 2025

This is better answered by actual Rust opsem gurus, but my understanding is that this is a Rustc choice rather than LLVM bug, and it comes from edge cases regarding uninitialized memory.

As in, in C it is possible to pass a struct a *a where a->x is initialized, but a->y is uninitialized. In that case, comparing them separately doesn't always produce UB (as the branch comparing a->y might not be reached), while comparing them as u32 will always be UB, so it's illegal for the compiler to optimize one into the other. Rustc currently also doesn't tell LLVM that a &T reference is guaranteed to point to fully initialized data (if I manually add !undef to the load instructions in LLVM IR, alive2 confirms that only then the optimization would be correct) and AFAIK it's still an ongoing question whether it's something they actually want to allow or not.

Ahh, that makes a lot more sense now. I'm not sure why Rust wouldn't want to emit those !undefs, though. Do you know where that discussion might be happening?

src/levels.rs Outdated
Comment on lines 397 to 402
// In C, `mv` is a union of either two int16_t values or a uint32_t,
// with the uint32_t variant used for comparisons as it is faster.
let this: u32 = transmute!(*self);
let other: u32 = transmute!(*other);

this == other
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// In C, `mv` is a union of either two int16_t values or a uint32_t,
// with the uint32_t variant used for comparisons as it is faster.
let this: u32 = transmute!(*self);
let other: u32 = transmute!(*other);
this == other
self.as_bytes() == other.as_bytes()

Same for the others, as @daxtens suggested.

I'll think about what we should leave for a comment. I think you can remove the comments you left for now, as the issue is a bit more generic and subtle than that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkysen done, let me know how you want to document this. Maybe "Bytewise comparison optimizes better than per-field comparison"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be mostly about not comparing per-field with &&. I'll give a suggestion in a new comment.

@adrian17
Copy link

adrian17 commented May 20, 2025

Ahh, that makes a lot more sense now. I'm not sure why Rust wouldn't want to emit those !undefs, though. Do you know where that discussion might be happening?

On review, I have been mostly wrong, sorry.

Looked some more into it (https://godbolt.org/z/vE6MK64Yf), and in this specific case Rust does initially add !noundef on all loads after all (later SimplifyCfg pass removes some for valid reasons). It might instead be the case of the vectorizer being overly zealous...? In fact, LLVM does have a pass intended to merge comparisons (MergeICmps), but it doesn't look like it got to do anything here (maybe because the vectorizer ran first, but I was unable to get it to optimize the pre-vectorization IR either).

Either way, looks like Rust already tracks an identical case under rust-lang/rust#140167 .

@ohadravid
Copy link
Contributor Author

@kkysen squashed. Let me know how you want to document this. Maybe "Bytewise comparison optimizes better than per-field comparison"?

src/levels.rs Outdated
Comment on lines 397 to 402
// In C, `mv` is a union of either two int16_t values or a uint32_t,
// with the uint32_t variant used for comparisons as it is faster.
let this: u32 = transmute!(*self);
let other: u32 = transmute!(*other);

this == other
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be mostly about not comparing per-field with &&. I'll give a suggestion in a new comment.

@ohadravid ohadravid requested a review from kkysen May 25, 2025 08:48
@ohadravid
Copy link
Contributor Author

@kkysen documented, aligned, squashed 😄

@ohadravid ohadravid force-pushed the perf/mv-micro-optim branch from b545fea to 3db6886 Compare May 25, 2025 08:49
@kkysen kkysen merged commit 1be76ea into memorysafety:main May 25, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants