-
Couldn't load subscription status.
- Fork 59
arm: reduce stack pressure when calling cdef_filter functions #1402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this would work, but could we just remove the extra args from cdef_filter_neon_erased? That makes them like the asm functions, where we just declare them to have more args than they actually do/use. Would probably need to transmute the function pointers or something (which I'm not sure you can do/how to do it), but it's an idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this:
diff --git a/src/cdef.rs b/src/cdef.rs
index 464e9438..1740841a 100644
--- a/src/cdef.rs
+++ b/src/cdef.rs
@@ -21,6 +21,7 @@ use libc::ptrdiff_t;
use std::cmp;
use std::ffi::c_int;
use std::ffi::c_uint;
+use std::mem;
use std::ptr;
#[cfg(all(
@@ -113,6 +114,25 @@ impl cdef::Fn {
)
}
}
+
+ pub const fn new_neon(
+ fn_ptr: unsafe extern "C" fn(
+ dst_ptr: *mut DynPixel,
+ stride: ptrdiff_t,
+ left: *const [LeftPixelRow2px<DynPixel>; 8],
+ top_ptr: *const DynPixel,
+ bottom_ptr: *const DynPixel,
+ pri_strength: c_int,
+ sec_strength: c_int,
+ dir: c_int,
+ damping: c_int,
+ edges: CdefEdgeFlags,
+ bitdepth_max: c_int,
+ // No extra args, like the asm functions.
+ ) -> (),
+ ) -> Self {
+ Self::new(unsafe { mem::transmute(fn_ptr) })
+ }
}
wrap_fn_ptr!(pub unsafe extern "C" fn cdef_dir(
@@ -631,9 +651,6 @@ mod neon {
damping: c_int,
edges: CdefEdgeFlags,
bitdepth_max: c_int,
- _dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
- _top: *const FFISafe<CdefTop>,
- _bottom: *const FFISafe<CdefBottom>,
) {
use crate::src::align::Align16;
@@ -739,9 +756,10 @@ impl Rav1dCdefDSPContext {
}
self.dir = bd_fn!(cdef_dir::decl_fn, BD, cdef_find_dir, neon);
- self.fb[0] = cdef::Fn::new(cdef_filter_neon_erased::<BD, 8, 8, 16, { 12 * 16 + 8 }>);
- self.fb[1] = cdef::Fn::new(cdef_filter_neon_erased::<BD, 4, 8, 8, { 12 * 8 + 8 }>);
- self.fb[2] = cdef::Fn::new(cdef_filter_neon_erased::<BD, 4, 4, 8, { 12 * 8 + 8 }>);
+
+ self.fb[0] = cdef::Fn::new_neon(cdef_filter_neon_erased::<BD, 8, 8, 16, { 12 * 16 + 8 }>);
+ self.fb[1] = cdef::Fn::new_neon(cdef_filter_neon_erased::<BD, 4, 8, 8, { 12 * 8 + 8 }>);
+ self.fb[2] = cdef::Fn::new_neon(cdef_filter_neon_erased::<BD, 4, 4, 8, { 12 * 8 + 8 }>);
self
}
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I'm not sure if being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know what you think - there's a new unsafe transmute
&[LeftPixelRow2px<DynPixel>; 8] <-> &[LeftPixelRow2px<BD::Pixel>; 8]that's needed to makes this work, and there might be a more correct "erased" version (but I think this is like the current*constversion that uses.cast().
This transmute seems totally fine, but in this version, how do you handle the asm fn versions without a wrapper function that adds overhead (this is what we originally tried, albeit on a different module, and it had worse overhead than we thought).
|
Ran into a different problem: the transmute version from #1402 (review) seems to produce the correct asm for I'll see if I can narrow it down somehow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran into a different problem: the transmute version from #1402 (review) seems to produce the correct asm for
cdef_filter_neon_erased, but is still slow. Even worse, I get inconsistent perf when rebuilding this. Probably means the actual problem isn't just the exra args, but something else that sometimes trips the compiler.I'll see if I can narrow it down somehow.
Hmm, that's a shame. Hope you can figure it out.
…r4_pri_edged_8bpc_neon`
|
I think I found it, and I have a partial solution. TL;DR: there’s some extra stack usage in The change is to use borrows for the args to $ # release = 786bef5, release-baseline = main
$ hyperfine --warmup 2 --runs 10 --parameter-list profile release,release-baseline 'target/{profile}/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1'
Benchmark 1: target/release/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
Time (mean ± σ): 72.306 s ± 0.110 s [User: 71.886 s, System: 0.266 s]
Range (min … max): 72.158 s … 72.475 s 10 runs
Benchmark 2: target/release-baseline/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
Time (mean ± σ): 72.513 s ± 0.122 s [User: 72.079 s, System: 0.260 s]
Range (min … max): 72.333 s … 72.680 s 10 runsDetailsI’m comparing 3 different versions: main, this branch, and and a "remove-args-and-panic-in-Rust-fallback" branch. Looking at the llvm IR for ( ; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
%10 = alloca [16 x i8], align 8
%11 = alloca [16 x i8], align 8
%12 = alloca [16 x i8], align 8
%13 = alloca [16 x i8], align 8
%14 = alloca [16 x i8], align 8
%15 = alloca [16 x i8], align 8
%16 = alloca [16 x i8], align 8
%17 = alloca [24 x i8], align 8
%18 = alloca [24 x i8], align 8
%19 = alloca [4 x i8], align 4
%20 = alloca [96 x i8], align 16
%21 = icmp sgt i32 %5, 0
%22 = select i1 %21, i32 12, i32 8
%23 = load ptr, ptr %3, align 8
..This In the no-args-panic version, we get: ; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
%10 = alloca [16 x i8], align 8
%11 = alloca [4 x i8], align 4
%12 = alloca [96 x i8], align 16
%13 = icmp sgt i32 %5, 0
%14 = select i1 %13, i32 12, i32 8
%15 = load ptr, ptr %3, align 8
..which is a sane-er. In this branch, we get: ; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::src::cdef_apply::rav1d_cdef_brow(..) {
%10 = alloca [24 x i8], align 8
%11 = alloca [16 x i8], align 8
%12 = alloca [24 x i8], align 8
%13 = alloca [16 x i8], align 8
%14 = alloca [4 x i8], align 4
%15 = alloca [48 x i8], align 8
%16 = alloca [96 x i8], align 16
%17 = icmp sgt i32 %5, 0
%18 = select i1 %17, i32 12, i32 8
%19 = load i64, ptr %3, align 8
..which still doesn’t make a lot of sense, but seem to behave better w.r.t the slow load in The no-args-panic is still faster (by ~400ms), but I haven't managed to get the same results without killing the pure-Rust fallback: |
Hello. Misdeclaring an external function can cause UB, yes, especially if any link-time optimization is performed. |
|
The documentation on
|
What counts as misdeclaring an external function defined in assembly, though? There's no |
|
That is certainly an excellent question, to which the answer is "I'm not sure". But for that case, my primary recommendation would be to avoid having incorrect declarations so that LLVM does not misoptimize around the function calls. Namely, don't declare arguments you don't actually care about. |
|
Ok @kkysen, I have another version that I think has the expected perf boost. Take a look at this branch - I replaced all the ; rav1d::src::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::src::cdef_apply::rav1d_cdef_brow(ptr nocapture noundef nonnull readonly align 64 %0, ptr noalias nocapture noundef nonnull align 64 dereferenceable(258624) %1, ptr noundef nonnull readonly align 16 %2, ptr noalias nocapture noundef nonnull readonly align 8 dereferenceable(48) %3, i32 noundef %4, i32 noundef %5, i32 noundef %6, i1 noundef zeroext %7, i32 noundef %8) unnamed_addr #1 personality ptr @rust_eh_personality {
%10 = alloca [16 x i8], align 8
%11 = alloca [4 x i8], align 4
%12 = alloca [96 x i8], align 16
%13 = icmp sgt i32 %5, 0
%14 = select i1 %13, i32 12, i32 8
%15 = load ptr, ptr %3, align 8
..WDYT? It's a bit uglier, but if you think it's worth it I'll update this branch to it / open a new PR. I also wasn't sure where to document this in the code.
# no-offset = 3de88ba12915fc07dd311964345399b6a166a027, perf/improve-cdef-brow-stack-usage
#
% hyperfine --warmup 2 --runs 10 --parameter-list profile no-offset,fast-but-panic,release-baseline,use-borrows 'target/{profile}/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1'
Benchmark 1: target/no-offset/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
Time (mean ± σ): 72.044 s ± 0.074 s [User: 71.598 s, System: 0.250 s]
Range (min … max): 71.918 s … 72.152 s 10 runs
# fast-but-panic = 5029b6b, perf/cdef-remove-args-and-panic-on-rust-fallback
# < just for comparison. >
Benchmark 2: target/fast-but-panic/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
Time (mean ± σ): 72.057 s ± 0.059 s [User: 71.613 s, System: 0.252 s]
Range (min … max): 71.943 s … 72.149 s 10 runs
# main
#
Benchmark 3: target/release-baseline/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
Time (mean ± σ): 72.450 s ± 0.108 s [User: 71.993 s, System: 0.255 s]
Range (min … max): 72.344 s … 72.650 s 10 runs
# use-borrows = 786bef5f546713692d0ee04507853bb498b518e8, asm-stack-imp
#
Benchmark 4: target/use-borrows/dav1d -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
Time (mean ± σ): 72.295 s ± 0.081 s [User: 71.840 s, System: 0.253 s]
Range (min … max): 72.128 s … 72.410 s 10 runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at this branch - I replaced all the
WithOffsetsparams infn call<BD: BitDepth>(..)to be data+offset, which seems to allow LLVM to properly optimize away everything:
Can you open a PR for that so I can comment on it? It's very weird that LLVM optimizes it that much better when it's essentially just removing the struct WithOffset abstraction, but I guess LLVM will be LLVM.
* Supersedes #1402. ## Summary There seems to be a slowdown in some of the Arm asm functions due to increased stack usage in `rav1d_cdef_brow` (`def_filter4_pri_edged_8bpc_neon` is ~20% slower when called from rav1d). By keeping `WithOffset`s "internal" to the function, LLVM is able to optimize away most of the `alloca`-s (first commit). After doing that, it seemed that `backup2x8` was still slower when calling `.stride()` (vs. a version which doesn't have the Rust fallback args _at all_), which I think is related to how LLVM optimizes the entire `rav1d_cdef_brow` (second commit). main: (`$ cargo asm -p rav1d-cli --bin dav1d --llvm rav1d_cdef_brow 0`) ```llvm ; rav1d::src::cdef_apply::rav1d_cdef_brow ; Function Attrs: nounwind define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) { %10 = alloca [16 x i8], align 8 %11 = alloca [16 x i8], align 8 %12 = alloca [16 x i8], align 8 %13 = alloca [16 x i8], align 8 %14 = alloca [16 x i8], align 8 %15 = alloca [16 x i8], align 8 %16 = alloca [16 x i8], align 8 %17 = alloca [24 x i8], align 8 %18 = alloca [24 x i8], align 8 %19 = alloca [4 x i8], align 4 %20 = alloca [96 x i8], align 16 %21 = icmp sgt i32 %5, 0 %22 = select i1 %21, i32 12, i32 8 %23 = load ptr, ptr %3, align 8 .. ``` This branch: ```llvm define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) { %10 = alloca [16 x i8], align 8 %11 = alloca [24 x i8], align 8 %12 = alloca [24 x i8], align 8 %13 = alloca [4 x i8], align 4 %14 = alloca [96 x i8], align 16 ``` ~This branch:~ (This is the most "optimized" version in this sense, but couldn't get the same with the current Rust/Asm argument arrangement requirements:) ```llvm ; rav1d::src::cdef_apply::rav1d_cdef_brow ; Function Attrs: nounwind define internal fastcc void @Rav1d::src::cdef_apply::rav1d_cdef_brow(..) { %10 = alloca [16 x i8], align 8 %11 = alloca [4 x i8], align 4 %12 = alloca [96 x i8], align 16 %13 = icmp sgt i32 %5, 0 %14 = select i1 %13, i32 12, i32 8 %15 = load ptr, ptr %3, align 8 .. ``` ## Full details When comparing to `dav1d`, the `cdef_filter4_pri_edged_8bpc_neon` asm function seems to be ~20% slower when called from `rav1d`. Looking at the per-instruction sample count, there's one with a big, consistent diff: dav1d:  rav1d:  From this, I tried to look at the callers. The problem _seems_ to be that when the src buffer in `x13` is placed "far enough back" in the stack, the load stalls. This fix is inspired by what I saw in rust-lang/rust#141649 - since `WithOffset`s seemed to match the `%10,..,%18` `alloca`-s in the IR above, I tried to force LLVM to do the right thing and optimize them away (I tried a few other things, but this is the most effective one). With this fix, the diff in the sample count is gone, with a nice speedup: ``` bash rav1d-pr % hyperfine --warmup 3 --runs 15 --parameter-list profile target/release/dav1d,target/release-baseline/dav1d '{profile} -q -i ~/workspace/video_files_for_rav1d/Chimera-AV1-8bit-1920x1080-6736kb ps.ivf -o /dev/null --threads 1' Benchmark 1: target/release/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1 Time (mean ± σ): 71.918 s ± 0.066 s [User: 71.595 s, System: 0.233 s] Range (min … max): 71.812 s … 72.032 s 15 runs Benchmark 2: target/release-baseline/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1 Time (mean ± σ): 72.294 s ± 0.078 s [User: 71.945 s, System: 0.235 s] Range (min … max): 72.183 s … 72.438 s 15 runs release-baseline = 1be76ea ``` I'm not sure how much better this will be in x86, but it should still be faster.

Summary
There seems to be a small slowdown in some of the arm asm functions due to increased stack used in the
cdef::Fnfunctions, which is caused by some extra parameters that are only used for the safe Rust fallback.I implemented this WIP fix which, while ugly, does seem to eliminate the slowdown.
Let me know if you think this is something that's worth pursuing 😄
Full details
Another small finding when comparing to
dav1d:the
cdef_filter4_pri_edged_8bpc_neonasm function seems to be ~20% slower when called fromrav1d.Looking at the per-instruction sample count, there's one with a big, consistent diff:
dav1d:

rav1d:

From this, I tried to look at the callers. My arm assembly skill isn't good enough to analyze this properly, but comparing:
Seems to suggest that there's more stack spillover, which I think is due to the extra params in
cdef::Fnimpl which are used for the safe Rust fallback.This in turn makes the data the asm function use just "far enough" to cause this slowdown.
With the changes in this PR, we get:
and the diff in the profiler is almost gone.