Skip to content

Conversation

@ohadravid
Copy link
Contributor

Summary

There seems to be a small slowdown in some of the arm asm functions due to increased stack used in the cdef::Fn functions, which is caused by some extra parameters that are only used for the safe Rust fallback.

I implemented this WIP fix which, while ugly, does seem to eliminate the slowdown.

Let me know if you think this is something that's worth pursuing 😄

Full details

Another small finding when comparing to dav1d:
the cdef_filter4_pri_edged_8bpc_neon asm function seems to be ~20% slower when called from rav1d.

Looking at the per-instruction sample count, there's one with a big, consistent diff:

dav1d:
dav1d

rav1d:
rav1d

From this, I tried to look at the callers. My arm assembly skill isn't good enough to analyze this properly, but comparing:

; dav1d's cdef_filter_4x4_neon
sub sp, sp, #0x150
stp x28, x27, [sp, #0xf0]
stp x26, x25, [sp, #0x100]
stp x24, x23, [sp, #0x110]
stp x22, x21, [sp, #0x120]
stp x20, x19, [sp, #0x130]
stp x29, x30, [sp, #0x140]
mov x19, x7
mov x20, x6
mov x21, x5
...
; rav1d's cdef_filter_neon_erased
stp x28, x27, [sp, #-0x60]!
stp x26, x25, [sp, #0x10]
stp x24, x23, [sp, #0x20]
stp x22, x21, [sp, #0x30]
stp x20, x19, [sp, #0x40]
stp x29, x30, [sp, #0x50]
add x29, sp, #0x50
sub sp, sp, #0x1a0
mov x19, x7
mov x20, x6
...

Seems to suggest that there's more stack spillover, which I think is due to the extra params in cdef::Fn impl which are used for the safe Rust fallback.

    _dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
    _top: *const FFISafe<CdefTop>,
    _bottom: *const FFISafe<CdefBottom>,

This in turn makes the data the asm function use just "far enough" to cause this slowdown.

With the changes in this PR, we get:

rav1d::src::cdef::neon::cdef_filter_neon_erased
sub sp, sp, #0x140
stp x28, x27, [sp, #0xe0]
stp x26, x25, [sp, #0xf0]
stp x24, x23, [sp, #0x100]
stp x22, x21, [sp, #0x110]
stp x20, x19, [sp, #0x120]
stp x29, x30, [sp, #0x130]
add x29, sp, #0x130
mov x19, x7
mov x20, x6
mov x21, x5
mov x5, x4
mov x4, x3
...

and the diff in the profiler is almost gone.

@kkysen kkysen self-requested a review May 20, 2025 10:23
Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this would work, but could we just remove the extra args from cdef_filter_neon_erased? That makes them like the asm functions, where we just declare them to have more args than they actually do/use. Would probably need to transmute the function pointers or something (which I'm not sure you can do/how to do it), but it's an idea.

Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this:

diff --git a/src/cdef.rs b/src/cdef.rs
index 464e9438..1740841a 100644
--- a/src/cdef.rs
+++ b/src/cdef.rs
@@ -21,6 +21,7 @@ use libc::ptrdiff_t;
 use std::cmp;
 use std::ffi::c_int;
 use std::ffi::c_uint;
+use std::mem;
 use std::ptr;

 #[cfg(all(
@@ -113,6 +114,25 @@ impl cdef::Fn {
             )
         }
     }
+
+    pub const fn new_neon(
+        fn_ptr: unsafe extern "C" fn(
+            dst_ptr: *mut DynPixel,
+            stride: ptrdiff_t,
+            left: *const [LeftPixelRow2px<DynPixel>; 8],
+            top_ptr: *const DynPixel,
+            bottom_ptr: *const DynPixel,
+            pri_strength: c_int,
+            sec_strength: c_int,
+            dir: c_int,
+            damping: c_int,
+            edges: CdefEdgeFlags,
+            bitdepth_max: c_int,
+            // No extra args, like the asm functions.
+        ) -> (),
+    ) -> Self {
+        Self::new(unsafe { mem::transmute(fn_ptr) })
+    }
 }

 wrap_fn_ptr!(pub unsafe extern "C" fn cdef_dir(
@@ -631,9 +651,6 @@ mod neon {
         damping: c_int,
         edges: CdefEdgeFlags,
         bitdepth_max: c_int,
-        _dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
-        _top: *const FFISafe<CdefTop>,
-        _bottom: *const FFISafe<CdefBottom>,
     ) {
         use crate::src::align::Align16;

@@ -739,9 +756,10 @@ impl Rav1dCdefDSPContext {
         }

         self.dir = bd_fn!(cdef_dir::decl_fn, BD, cdef_find_dir, neon);
-        self.fb[0] = cdef::Fn::new(cdef_filter_neon_erased::<BD, 8, 8, 16, { 12 * 16 + 8 }>);
-        self.fb[1] = cdef::Fn::new(cdef_filter_neon_erased::<BD, 4, 8, 8, { 12 * 8 + 8 }>);
-        self.fb[2] = cdef::Fn::new(cdef_filter_neon_erased::<BD, 4, 4, 8, { 12 * 8 + 8 }>);
+
+        self.fb[0] = cdef::Fn::new_neon(cdef_filter_neon_erased::<BD, 8, 8, 16, { 12 * 16 + 8 }>);
+        self.fb[1] = cdef::Fn::new_neon(cdef_filter_neon_erased::<BD, 4, 8, 8, { 12 * 8 + 8 }>);
+        self.fb[2] = cdef::Fn::new_neon(cdef_filter_neon_erased::<BD, 4, 4, 8, { 12 * 8 + 8 }>);

         self
     }

@ohadravid

This comment was marked as resolved.

@ohadravid

This comment was marked as resolved.

@kkysen
Copy link
Collaborator

kkysen commented May 26, 2025

I think (but not sure) that it's unsound to transmute the function. While the calling convention might allow it, it is UB by the C standard: C11 §6.5.2.2 Function calls say

If the number of arguments does not equal the number of parameters, the behavior is undefined.

which should (?) apply in this case since we use extern "C".

I'm not sure if being extern "C" means the C standard applies to Rust. Does the Rustonomicon say anything about this? We do already do the same for asm functions (declare them in Rust with more args than they actually have in asm).

Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think - there's a new unsafe transmute &[LeftPixelRow2px<DynPixel>; 8] <-> &[LeftPixelRow2px<BD::Pixel>; 8] that's needed to makes this work, and there might be a more correct "erased" version (but I think this is like the current *const version that uses .cast().

This transmute seems totally fine, but in this version, how do you handle the asm fn versions without a wrapper function that adds overhead (this is what we originally tried, albeit on a different module, and it had worse overhead than we thought).

@ohadravid
Copy link
Contributor Author

Ran into a different problem: the transmute version from #1402 (review) seems to produce the correct asm for cdef_filter_neon_erased, but is still slow. Even worse, I get inconsistent perf when rebuilding this. Probably means the actual problem isn't just the exra args, but something else that sometimes trips the compiler.

I'll see if I can narrow it down somehow.

@ohadravid ohadravid changed the title [wip] arm: reduce stack pressure when calling cdef_filter functions arm: reduce stack pressure when calling cdef_filter functions May 27, 2025
Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into a different problem: the transmute version from #1402 (review) seems to produce the correct asm for cdef_filter_neon_erased, but is still slow. Even worse, I get inconsistent perf when rebuilding this. Probably means the actual problem isn't just the exra args, but something else that sometimes trips the compiler.

I'll see if I can narrow it down somehow.

Hmm, that's a shame. Hope you can figure it out.

@ohadravid
Copy link
Contributor Author

ohadravid commented May 27, 2025

I think I found it, and I have a partial solution.

TL;DR: there’s some extra stack usage in rav1d_cdef_brow, and this branch now shaves off about 200ms from the 72.5sec in main. Didn't need the transmute in the end.

The change is to use borrows for the args to cdef::Fn::call.

$ # release = 786bef5, release-baseline = main
$ hyperfine --warmup 2 --runs 10 --parameter-list profile release,release-baseline 'target/{profile}/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1'
Benchmark 1: target/release/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     72.306 s ±  0.110 s    [User: 71.886 s, System: 0.266 s]
  Range (min … max):   72.158 s … 72.475 s    10 runs

Benchmark 2: target/release-baseline/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     72.513 s ±  0.122 s    [User: 72.079 s, System: 0.260 s]
  Range (min … max):   72.333 s … 72.680 s    10 runs

Details

I’m comparing 3 different versions:

main, this branch, and and a "remove-args-and-panic-in-Rust-fallback" branch.

Looking at the llvm IR for rav1d_cdef_brow, there’s something strange at the start of the function in main:

($ cargo asm -p rav1d-cli --bin dav1d --llvm rav1d_cdef_brow 0)

; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
  %10 = alloca [16 x i8], align 8
  %11 = alloca [16 x i8], align 8
  %12 = alloca [16 x i8], align 8
  %13 = alloca [16 x i8], align 8
  %14 = alloca [16 x i8], align 8
  %15 = alloca [16 x i8], align 8
  %16 = alloca [16 x i8], align 8
  %17 = alloca [24 x i8], align 8
  %18 = alloca [24 x i8], align 8
  %19 = alloca [4 x i8], align 4
  %20 = alloca [96 x i8], align 16
  %21 = icmp sgt i32 %5, 0
  %22 = select i1 %21, i32 12, i32 8
  %23 = load ptr, ptr %3, align 8
  ..

This allocas are related to WithOffset usage (16-byte allocs probably for WithOffset<&'a DisjointMut<AlignedVec64<u8>>>, 24-byte allocs for WithOffset<PicOrBuf<'a, AlignedVec64<u8>>>).

In the no-args-panic version, we get:

; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
  %10 = alloca [16 x i8], align 8
  %11 = alloca [4 x i8], align 4
  %12 = alloca [96 x i8], align 16
  %13 = icmp sgt i32 %5, 0
  %14 = select i1 %13, i32 12, i32 8
  %15 = load ptr, ptr %3, align 8
  ..

which is a sane-er.

In this branch, we get:

; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
  %10 = alloca [24 x i8], align 8
  %11 = alloca [16 x i8], align 8
  %12 = alloca [24 x i8], align 8
  %13 = alloca [16 x i8], align 8
  %14 = alloca [4 x i8], align 4
  %15 = alloca [48 x i8], align 8
  %16 = alloca [96 x i8], align 16
  %17 = icmp sgt i32 %5, 0
  %18 = select i1 %17, i32 12, i32 8
  %19 = load i64, ptr %3, align 8
  ..

which still doesn’t make a lot of sense, but seem to behave better w.r.t the slow load in cdef_filter4_pri_edged_8bpc_neon:

rav1d-cdef-borrow-asm

The no-args-panic is still faster (by ~400ms), but I haven't managed to get the same results without killing the pure-Rust fallback:

$ release = 5029b6b (perf/cdef-remove-args-and-panic-on-rust-fallback)
$ hyperfine --warmup 2 --runs 10 --parameter-list profile release,release-baseline 'target/{profile}/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1'
Benchmark 1: target/release/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     71.883 s ±  0.347 s    [User: 71.462 s, System: 0.221 s]
  Range (min … max):   71.561 s … 72.337 s    10 runs

@workingjubilee
Copy link

I'm not sure if being extern "C" means the C standard applies to Rust. Does the Rustonomicon say anything about this? We do already do the same for asm functions (declare them in Rust with more args than they actually have in asm).

Hello.

Misdeclaring an external function can cause UB, yes, especially if any link-time optimization is performed.

@workingjubilee
Copy link

The documentation on fn types is not an exhaustive list of the requirements for declaring an external function and correctly calling them, but the rules on ABI compatibility are the minimum requirements... i.e. if you fall short of these then you already are going to have problems, and declaring an external function, linking against it, etc. will have more. And they say:

For two signatures to be considered ABI-compatible, they must use a compatible ABI string, must take the same number of arguments, and the individual argument types and the return types must be ABI-compatible.

@kkysen
Copy link
Collaborator

kkysen commented May 27, 2025

Misdeclaring an external function can cause UB, yes, especially if any link-time optimization is performed.

What counts as misdeclaring an external function defined in assembly, though? There's no extern "C" definition there.

@workingjubilee
Copy link

workingjubilee commented May 27, 2025

That is certainly an excellent question, to which the answer is "I'm not sure".

But for that case, my primary recommendation would be to avoid having incorrect declarations so that LLVM does not misoptimize around the function calls. Namely, don't declare arguments you don't actually care about.

@ohadravid
Copy link
Contributor Author

ohadravid commented May 28, 2025

Ok @kkysen, I have another version that I think has the expected perf boost.

Take a look at this branch - I replaced all the WithOffsets params in fn call<BD: BitDepth>(..) to be data+offset, which seems to allow LLVM to properly optimize away everything:

; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::src::cdef_apply::rav1d_cdef_brow(ptr nocapture noundef nonnull readonly align 64 %0, ptr noalias nocapture noundef nonnull align 64 dereferenceable(258624) %1, ptr noundef nonnull readonly align 16 %2, ptr noalias nocapture noundef nonnull readonly align 8 dereferenceable(48) %3, i32 noundef %4, i32 noundef %5, i32 noundef %6, i1 noundef zeroext %7, i32 noundef %8) unnamed_addr #1 personality ptr @rust_eh_personality {
  %10 = alloca [16 x i8], align 8
  %11 = alloca [4 x i8], align 4
  %12 = alloca [96 x i8], align 16
  %13 = icmp sgt i32 %5, 0
  %14 = select i1 %13, i32 12, i32 8
  %15 = load ptr, ptr %3, align 8
  ..

WDYT? It's a bit uglier, but if you think it's worth it I'll update this branch to it / open a new PR. I also wasn't sure where to document this in the code.
Overall perf boost is almost 0.5 sec vs main 🎉

Will update here the final perf comparisons later. Done

# no-offset = 3de88ba12915fc07dd311964345399b6a166a027, perf/improve-cdef-brow-stack-usage
#
% hyperfine --warmup 2 --runs 10 --parameter-list profile no-offset,fast-but-panic,release-baseline,use-borrows 'target/{profile}/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1'
Benchmark 1: target/no-offset/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     72.044 s ±  0.074 s    [User: 71.598 s, System: 0.250 s]
  Range (min … max):   71.918 s … 72.152 s    10 runs

# fast-but-panic = 5029b6b, perf/cdef-remove-args-and-panic-on-rust-fallback
# < just for comparison. >
Benchmark 2: target/fast-but-panic/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     72.057 s ±  0.059 s    [User: 71.613 s, System: 0.252 s]
  Range (min … max):   71.943 s … 72.149 s    10 runs

# main
#
Benchmark 3: target/release-baseline/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     72.450 s ±  0.108 s    [User: 71.993 s, System: 0.255 s]
  Range (min … max):   72.344 s … 72.650 s    10 runs

# use-borrows = 786bef5f546713692d0ee04507853bb498b518e8, asm-stack-imp
# 
Benchmark 4: target/use-borrows/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     72.295 s ±  0.081 s    [User: 71.840 s, System: 0.253 s]
  Range (min … max):   72.128 s … 72.410 s    10 runs

Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at this branch - I replaced all the WithOffsets params in fn call<BD: BitDepth>(..) to be data+offset, which seems to allow LLVM to properly optimize away everything:

Can you open a PR for that so I can comment on it? It's very weird that LLVM optimizes it that much better when it's essentially just removing the struct WithOffset abstraction, but I guess LLVM will be LLVM.

kkysen added a commit that referenced this pull request Jun 14, 2025
* Supersedes #1402.

## Summary

There seems to be a slowdown in some of the Arm asm functions due to
increased stack usage in `rav1d_cdef_brow`
(`def_filter4_pri_edged_8bpc_neon` is ~20% slower when called from
rav1d).

By keeping `WithOffset`s "internal" to the function, LLVM is able to
optimize away most of the `alloca`-s (first commit).

After doing that, it seemed that `backup2x8` was still slower when
calling `.stride()` (vs. a version which doesn't have the Rust fallback
args _at all_), which I think is related to how LLVM optimizes the
entire `rav1d_cdef_brow` (second commit).

main:

(`$ cargo asm -p rav1d-cli --bin dav1d --llvm rav1d_cdef_brow 0`)

```llvm
; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
  %10 = alloca [16 x i8], align 8
  %11 = alloca [16 x i8], align 8
  %12 = alloca [16 x i8], align 8
  %13 = alloca [16 x i8], align 8
  %14 = alloca [16 x i8], align 8
  %15 = alloca [16 x i8], align 8
  %16 = alloca [16 x i8], align 8
  %17 = alloca [24 x i8], align 8
  %18 = alloca [24 x i8], align 8
  %19 = alloca [4 x i8], align 4
  %20 = alloca [96 x i8], align 16
  %21 = icmp sgt i32 %5, 0
  %22 = select i1 %21, i32 12, i32 8
  %23 = load ptr, ptr %3, align 8
  ..
```

This branch:
```llvm
define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
  %10 = alloca [16 x i8], align 8
  %11 = alloca [24 x i8], align 8
  %12 = alloca [24 x i8], align 8
  %13 = alloca [4 x i8], align 4
  %14 = alloca [96 x i8], align 16
```

~This branch:~
(This is the most "optimized" version in this sense, but couldn't get
the same with the current Rust/Asm argument arrangement requirements:)

```llvm
; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
  %10 = alloca [16 x i8], align 8
  %11 = alloca [4 x i8], align 4
  %12 = alloca [96 x i8], align 16
  %13 = icmp sgt i32 %5, 0
  %14 = select i1 %13, i32 12, i32 8
  %15 = load ptr, ptr %3, align 8
  ..
```


## Full details

When comparing to `dav1d`, the `cdef_filter4_pri_edged_8bpc_neon` asm
function seems to be ~20% slower when called from `rav1d`.

Looking at the per-instruction sample count, there's one with a big,
consistent diff:

dav1d:

![dav1d](https://github.com/user-attachments/assets/2bc606aa-f27e-4e09-9c45-403cf6fc780d)

rav1d:

![rav1d](https://github.com/user-attachments/assets/607c87b7-43c3-44d5-9e7e-2213d3dc4362)

From this, I tried to look at the callers. The problem _seems_ to be
that when the src buffer in `x13` is placed "far enough back" in the
stack, the load stalls.

This fix is inspired by what I saw in
rust-lang/rust#141649 - since `WithOffset`s
seemed to match the `%10,..,%18` `alloca`-s in the IR above, I tried to
force LLVM to do the right thing and optimize them away (I tried a few
other things, but this is the most effective one).


With this fix, the diff in the sample count is gone, with a nice
speedup:

``` bash
rav1d-pr % hyperfine --warmup 3 --runs 15 --parameter-list profile target/release/dav1d,target/release-baseline/dav1d  '{profile} -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kb
ps.ivf -o /dev/null --threads 1'

Benchmark 1: target/release/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     71.918 s ±  0.066 s    [User: 71.595 s, System: 0.233 s]
  Range (min … max):   71.812 s … 72.032 s    15 runs

Benchmark 2: target/release-baseline/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
  Time (mean ± σ):     72.294 s ±  0.078 s    [User: 71.945 s, System: 0.235 s]
  Range (min … max):   72.183 s … 72.438 s    15 runs

release-baseline = 1be76ea
```

I'm not sure how much better this will be in x86, but it should still be
faster.
@ohadravid ohadravid closed this Jun 16, 2025
@ohadravid ohadravid deleted the asm-stack-imp branch June 16, 2025 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants