[mono] Performance regression in many places caused by #77562 #78163

radekdoulik · 2022-11-10T15:59:23Z

We see many large regressions caused by #77562

Microbenchmarks issues:

dotnet/perf-autofiling-issues#9599
dotnet/perf-autofiling-issues#9612
dotnet/perf-autofiling-issues#9636
dotnet/perf-autofiling-issues#9637

dotnet-issue-labeler · 2022-11-10T15:59:26Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2022-11-10T16:02:52Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

We see many large regressions caused by #77562

Microbenchmarks issues:

dotnet/perf-autofiling-issues#9599
dotnet/perf-autofiling-issues#9612
dotnet/perf-autofiling-issues#9636
dotnet/perf-autofiling-issues#9637

Author:	radekdoulik
Assignees:	-
Labels:	`area-System.Runtime.Intrinsics`, `untriaged`
Milestone:	-

AaronRobinsonMSFT · 2022-11-10T16:02:56Z

/cc @tannergooding

tannergooding · 2022-11-10T16:13:09Z

@radekdoulik are any of these for the intrinsic path or are they all for the software fallback?

It looks like the latter in which case, this should be closed as "by design". Developers aren't expect to be using the software fallback and should realistically only be using these paths when IsHardwareAccelerated reports true (and therefore the underlying JIT/AOT will correctly optimize these down to the correct SIMD instructions for a given platform).

Changes to improve code maintainability and correctness for the non-accelerated path have always been accepted here.

radekdoulik · 2022-11-10T16:14:26Z

I was able to reproduce the regression in browser-bench too and confirm that the regression is caused by #77562

The repro

Add https://gist.github.com/radekdoulik/e4d61f21d0b592c48b664a6deafb9dc8 to the src\mono\sample\wasm\browser-bench and run the benchmark in the browser with url options ?task=Micro so for example http://localhost:62574/?task=Micro

On my local machine I measure these results.
Before regression:

measurement	time
Vector, Repro 1 InequalityOperator	0.0769us
Vector, Repro 2 InequalityOperator	0.0769us
Vector, Repro 3 BitwiseAndOperator	0.0644us

With regression:

measurement	time
Vector, Repro 1 InequalityOperator	0.5401us
Vector, Repro 2 InequalityOperator	0.5506us
Vector, Repro 3 BitwiseAndOperator	0.0428us

tannergooding · 2022-11-10T16:17:11Z

I was able to reproduce the regression in browser-bench too and confirm that the regression is caused by #77562

Sounds like a case where browser/WASM doesn't have the necessary SIMD acceleration yet and so the results should be ignored for that platform until IsHardwareAccelerated starts reporting true.

radekdoulik · 2022-11-10T16:18:49Z

@radekdoulik are any of these for the intrinsic path or are they all for the software fallback?

It looks like the latter in which case, this should be closed as "by design". Developers aren't expect to be using the software fallback and should realistically only be using these paths when IsHardwareAccelerated reports true (and therefore the underlying JIT/AOT will correctly optimize these down to the correct SIMD instructions for a given platform).

Changes to improve code maintainability and correctness for the non-accelerated path have always been accepted here.

We don't have SIMD enabled by default yet, so the IsHardwareAccelerated is false.

I think the regressions here are too large and so we would like to understand what is going on. It might be a bug in the mono runtime or something else.

tannergooding · 2022-11-10T16:26:08Z

I think the regressions here are too large and so we would like to understand what is going on. It might be a bug in the mono runtime or something else.

In the case of things like BitwiseAnd, a correctness fix was taken to ensure that indirect invocation (reflection, delegates, etc) behaved the same as the direct invocation scenario when AVX2 was supported. Previously, operator & would only ever touch the lower 128-bits in such scenarios, but the type is resized by the VM to be 256-bits. This lead to debugger operations and other scenarios being inconsistent.

Similar fixes were taken elsewhere as well. This results in more logic having to be done for the software fallback and means that a regression will be visible when IsHardwareAccelerated == false. For the SIMD accelerated case, these methods are intrinsic and so the actual IL doesn't matter. We now have more methods recognized directly as intrinsic and so less work needs to be done by the JIT to ensure good codegen. This is a net win for platforms that have the SIMD handling already and no change to platforms without (it is just adding an [Intrinsic] attribute, so no impact to the actual code).

Like I mentioned, the software fallback is not meant or expected to be used. It's there to support scenarios like the debugger and simple correctness validation on platforms without acceleration. We expose IsHardwareAccelerated and expect users to call it before utilizing the APIs in production scenarios. If it returns false, then the expectation is that the code is going to be slow, potentially much slower than a naive scalar algorithm due to needing a call and loop per operation.

Because of this, results should likely be ignored for any target where IsHardwareAccelerated == false. These will all be fixed and working as expected once Mono finishes enabling SIMD support. We can then track any perf differences between Mono and RyuJIT as fixes to the respective runtime handling of these intrinsics.

radekdoulik · 2022-11-10T16:28:22Z

@vargaz when I check the difference in MicroTask.Repro1.RunStep I see:

(func Wasm_Browser_Bench_Sample_Sample_MicroTask_Repro1_RunStep(param i32, i32))
...
   local.tee $2
   global.set $__stack_pointer
-  i32.const 2297152
+  i32.const 2301696
   i32.load8.u
   i32.eqz
   if
-   i32.const 1376481
+   i32.const 1379105
    call mono_aot_Wasm_Browser_Bench_Sample_init_method
-   i32.const 2297152
+   i32.const 2301696
    i32.const 1
    i32.store8
...
    i32.load offset:12 align:2
    local.tee $0
-   i32.store offset:12 align:2
+   i32.store offset:28 align:2
    local.get $0
    i32.eqz
...
    br.if
    local.get $2
-   i32.const 2290552
+   i32.const 2295032
    i32.load align:2
    local.tee $0
...
    i64.load offset:8
    local.tee $3
-   i64.store offset:24 align:3
+   i64.store offset:16 align:3
    local.get $2
    local.get $0
...
    i64.load
    local.tee $4
-   i64.store offset:16 align:3
-   i32.const 2290544
+   i64.store offset:8 align:3
+   i32.const 2295024
    i32.load align:2
    local.tee $0
...
     i32.add
     i32.load8.u
-    i32.const 2290560
+    i32.const 2295040
     i32.load align:2
     local.tee $9
...
     i32.const 255
     i32.and
+    i32.eqz
     if
      local.get $0

I think the important difference is the added i32.eqz. I am not sure yet where does it come from, but it seems to loop to $label1 in the regressed version, while the original doesn't enter the if block and returns without looping.

I am going to look further into it, you might spot it faster though I guess. Also note that I am away tomorrow.

This is the complete regressed functuion.

(func Wasm_Browser_Bench_Sample_Sample_MicroTask_Repro1_RunStep(param $0 i32, $1 i32))
 local $2 i32
 local $3 i64
 local $4 i64
 local $5 i64
 local $6 i64
 local $7 i32
 local $8 i32
 local $9 i32
 global.get $__stack_pointer
 i32.const -1
 i32.add
 local.tee $2
 global.set $__stack_pointer
 i32.const 2301696
 i32.load8.u
 i32.eqz
 if
  i32.const 1379105
  call mono_aot_Wasm_Browser_Bench_Sample_init_method
  i32.const 2301696
  i32.const 1
  i32.store8
 
 block
  local.get $0
  i32.eqz
  br.if
  local.get $2
  local.get $0
  i32.load offset:12 align:2
  local.tee $0
  i32.store offset:28 align:2
  local.get $0
  i32.eqz
  br.if
  local.get $2
  i32.const 2295032
  i32.load align:2
  local.tee $0
  i64.load offset:8
  local.tee $3
  i64.store offset:16 align:3
  local.get $2
  local.get $0
  i64.load
  local.tee $4
  i64.store offset:8 align:3
  i32.const 2295024
  i32.load align:2
  local.tee $0
  i64.load offset:8
  local.set $5
  local.get $0
  i64.load
  local.set $6
  local.get $2
  local.get $3
  i64.store offset:56 align:3
  local.get $2
  local.get $5
  i64.store offset:40 align:3
  local.get $2
  local.get $4
  i64.store offset:48 align:3
  local.get $2
  local.get $6
  i64.store offset:32 align:3
  i32.const -1
  local.get $2
  i32.sub
  local.set $7
  i32.const -1
  local.get $2
  i32.sub
  local.set $8
  i32.const 0
  local.set $0
  loop
   local.get $0
   local.get $8
   i32.eq
   br.if
   local.get $0
   local.get $7
   i32.eq
   br.if
   local.get $2
   i32.const 32
   i32.add
   local.get $0
   i32.add
   i32.load8.u
   local.get $2
   i32.const 48
   i32.add
   local.get $0
   i32.add
   i32.load8.u
   i32.const 2295040
   i32.load align:2
   local.tee $9
   i32.load offset:4 align:2
   local.get $9
   i32.load align:2
   call.indirect (func (param i32, i32, i32) (result i32))
   i32.const 255
   i32.and
   i32.eqz
   if
    local.get $0
    i32.const 1
    i32.add
    local.tee $0
    i32.const 16
    i32.ne
    br.if
   
  
  local.get $2
  i32.const -1
  i32.sub
  global.set $__stack_pointer
  return
 
 call mini_llvmonly_throw_nullref_exception
 unreachable

naricc · 2022-11-10T16:40:32Z

These regressions are also showing up on llvm-aot where we should have support for intrinsics, but there maybe something else going on preventing them from being used. @fanyang-mono Should intrinsics end up used here on llvm?

I have been trying to repro to investigate but ran into some unrelated issues building.

radekdoulik · 2022-11-10T16:40:43Z

I think the regressions here are too large and so we would like to understand what is going on. It might be a bug in the mono runtime or something else.

In the case of things like BitwiseAnd, a correctness fix was taken to ensure that indirect invocation (reflection, delegates, etc) behaved the same as the direct invocation scenario when AVX2 was supported. Previously, operator & would only ever touch the lower 128-bits in such scenarios, but the type is resized by the VM to be 256-bits. This lead to debugger operations and other scenarios being inconsistent.

Similar fixes were taken elsewhere as well. This results in more logic having to be done for the software fallback and means that a regression will be visible when IsHardwareAccelerated == false. For the SIMD accelerated case, these methods are intrinsic and so the actual IL doesn't matter. We now have more methods recognized directly as intrinsic and so less work needs to be done by the JIT to ensure good codegen. This is a net win for platforms that have the SIMD handling already and no change to platforms without (it is just adding an [Intrinsic] attribute, so no impact to the actual code).

Like I mentioned, the software fallback is not meant or expected to be used. It's there to support scenarios like the debugger and simple correctness validation on platforms without acceleration. We expose IsHardwareAccelerated and expect users to call it before utilizing the APIs in production scenarios. If it returns false, then the expectation is that the code is going to be slow, potentially much slower than a naive scalar algorithm due to needing a call and loop per operation.

Because of this, results should likely be ignored for any target where IsHardwareAccelerated == false. These will all be fixed and working as expected once Mono finishes enabling SIMD support. We can then track any perf differences between Mono and RyuJIT as fixes to the respective runtime handling of these intrinsics.

That makes sense. OTOH I wonder why it is now 4-6 times slower in things like InequalityOperatorBenchmark (Value1 != Value2) for Vector<byte> for example.

tannergooding · 2022-11-10T16:49:49Z

Perhaps Mono has acceleration for operator == but not operator != and was relying on !(left == right).

The fallback was changed to be an explicit loop to help ensure consistency and correctness across the board. It also highlights if SIMD acceleration isn't enabled for the method and therefore potential missed optimization opportunities for modern hardware.

fanyang-mono · 2022-11-10T17:07:44Z

These regressions are also showing up on llvm-aot where we should have support for intrinsics, but there maybe something else going on preventing them from being used. @fanyang-mono Should intrinsics end up used here on llvm?

I have been trying to repro to investigate but ran into some unrelated issues building.

The Vector microbenchmarks were written with generics. For the llvm-aot runs, those microbenchmarks methods would fall back to mini JIT. Mono mini JIT doesn't support SIMD intrinsics yet, so they are currently using the software fallback version. That's why they are experiencing regression as well.

radekdoulik · 2022-11-10T17:20:57Z

I was blind and didn't notice the != operator change.

I think it is incorrect. It should return true when it finds first element inequality.

radekdoulik · 2022-11-10T19:14:30Z

With the #78177 the times are back to:

measurement	time
Vector, Repro 1 InequalityOperator	0.0796us
Vector, Repro 2 InequalityOperator	0.0778us
Vector, Repro 3 BitwiseAndOperator	0.0433us

…hods containing type equality checks. For example, calls to Vector<T>.IsSupported cannot be optimized away in a method where T is T_BYTE because its written as: ``` get => (typeof(T) == typeof(byte)) || (typeof(T) == typeof(double)) || ``` and T_BYTE could be instantiated with an enum whose basetype is byte. Fixes some of the issues in dotnet#78163.

#78182) * [mono][aot] Prefer concrete instances instead of gshared ones for methods containing type equality checks. For example, calls to Vector<T>.IsSupported cannot be optimized away in a method where T is T_BYTE because its written as: ``` get => (typeof(T) == typeof(byte)) || (typeof(T) == typeof(double)) || ``` and T_BYTE could be instantiated with an enum whose basetype is byte. Fixes some of the issues in #78163. * Avoid an assert when compiling Vector<object> instances.

ghost added the untriaged New issue has not been triaged by the area owner label Nov 10, 2022

AaronRobinsonMSFT added the area-System.Runtime.Intrinsics label Nov 10, 2022

radekdoulik mentioned this issue Nov 10, 2022

Fix Vector<T> != operator #78177

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 10, 2022

vargaz mentioned this issue Nov 10, 2022

[mono][aot] Prefer concrete instances instead of gshared ones for met… #78182

Merged

radekdoulik closed this as completed in #78177 Nov 11, 2022

ghost removed in-pr There is an active PR which will close this issue when it is merged untriaged New issue has not been triaged by the area owner labels Nov 11, 2022

kotlarmilos mentioned this issue Nov 25, 2022

[mono][perf] iOS disk size regression on 11 Nov 2022 #78851

Open

ghost locked as resolved and limited conversation to collaborators Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mono] Performance regression in many places caused by #77562 #78163

[mono] Performance regression in many places caused by #77562 #78163

radekdoulik commented Nov 10, 2022

dotnet-issue-labeler bot commented Nov 10, 2022

ghost commented Nov 10, 2022

AaronRobinsonMSFT commented Nov 10, 2022

tannergooding commented Nov 10, 2022 •

edited

Loading

radekdoulik commented Nov 10, 2022

tannergooding commented Nov 10, 2022

radekdoulik commented Nov 10, 2022

tannergooding commented Nov 10, 2022 •

edited

Loading

radekdoulik commented Nov 10, 2022

naricc commented Nov 10, 2022 •

edited

Loading

radekdoulik commented Nov 10, 2022

tannergooding commented Nov 10, 2022

fanyang-mono commented Nov 10, 2022

radekdoulik commented Nov 10, 2022

radekdoulik commented Nov 10, 2022 •

edited

Loading

[mono] Performance regression in many places caused by #77562 #78163

[mono] Performance regression in many places caused by #77562 #78163

Comments

radekdoulik commented Nov 10, 2022

dotnet-issue-labeler bot commented Nov 10, 2022

ghost commented Nov 10, 2022

AaronRobinsonMSFT commented Nov 10, 2022

tannergooding commented Nov 10, 2022 • edited Loading

radekdoulik commented Nov 10, 2022

tannergooding commented Nov 10, 2022

radekdoulik commented Nov 10, 2022

tannergooding commented Nov 10, 2022 • edited Loading

radekdoulik commented Nov 10, 2022

naricc commented Nov 10, 2022 • edited Loading

radekdoulik commented Nov 10, 2022

tannergooding commented Nov 10, 2022

fanyang-mono commented Nov 10, 2022

radekdoulik commented Nov 10, 2022

radekdoulik commented Nov 10, 2022 • edited Loading

tannergooding commented Nov 10, 2022 •

edited

Loading

tannergooding commented Nov 10, 2022 •

edited

Loading

naricc commented Nov 10, 2022 •

edited

Loading

radekdoulik commented Nov 10, 2022 •

edited

Loading