Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Performance regression in many places caused by #77562 #78163

Closed
radekdoulik opened this issue Nov 10, 2022 · 15 comments · Fixed by #78177
Closed

[mono] Performance regression in many places caused by #77562 #78163

radekdoulik opened this issue Nov 10, 2022 · 15 comments · Fixed by #78177

Comments

@radekdoulik
Copy link
Member

We see many large regressions caused by #77562

Microbenchmarks issues:

dotnet/perf-autofiling-issues#9599
dotnet/perf-autofiling-issues#9612
dotnet/perf-autofiling-issues#9636
dotnet/perf-autofiling-issues#9637

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 10, 2022
@ghost
Copy link

ghost commented Nov 10, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

We see many large regressions caused by #77562

Microbenchmarks issues:

dotnet/perf-autofiling-issues#9599
dotnet/perf-autofiling-issues#9612
dotnet/perf-autofiling-issues#9636
dotnet/perf-autofiling-issues#9637

Author: radekdoulik
Assignees: -
Labels:

area-System.Runtime.Intrinsics, untriaged

Milestone: -

@AaronRobinsonMSFT
Copy link
Member

/cc @tannergooding

@tannergooding
Copy link
Member

tannergooding commented Nov 10, 2022

@radekdoulik are any of these for the intrinsic path or are they all for the software fallback?

It looks like the latter in which case, this should be closed as "by design". Developers aren't expect to be using the software fallback and should realistically only be using these paths when IsHardwareAccelerated reports true (and therefore the underlying JIT/AOT will correctly optimize these down to the correct SIMD instructions for a given platform).

Changes to improve code maintainability and correctness for the non-accelerated path have always been accepted here.

@radekdoulik
Copy link
Member Author

I was able to reproduce the regression in browser-bench too and confirm that the regression is caused by #77562

The repro

Add https://gist.github.com/radekdoulik/e4d61f21d0b592c48b664a6deafb9dc8 to the src\mono\sample\wasm\browser-bench and run the benchmark in the browser with url options ?task=Micro so for example http://localhost:62574/?task=Micro

On my local machine I measure these results.
Before regression:

measurement time
Vector, Repro 1 InequalityOperator 0.0769us
Vector, Repro 2 InequalityOperator 0.0769us
Vector, Repro 3 BitwiseAndOperator 0.0644us

With regression:

measurement time
Vector, Repro 1 InequalityOperator 0.5401us
Vector, Repro 2 InequalityOperator 0.5506us
Vector, Repro 3 BitwiseAndOperator 0.0428us

@tannergooding
Copy link
Member

I was able to reproduce the regression in browser-bench too and confirm that the regression is caused by #77562

Sounds like a case where browser/WASM doesn't have the necessary SIMD acceleration yet and so the results should be ignored for that platform until IsHardwareAccelerated starts reporting true.

@radekdoulik
Copy link
Member Author

@radekdoulik are any of these for the intrinsic path or are they all for the software fallback?

It looks like the latter in which case, this should be closed as "by design". Developers aren't expect to be using the software fallback and should realistically only be using these paths when IsHardwareAccelerated reports true (and therefore the underlying JIT/AOT will correctly optimize these down to the correct SIMD instructions for a given platform).

Changes to improve code maintainability and correctness for the non-accelerated path have always been accepted here.

We don't have SIMD enabled by default yet, so the IsHardwareAccelerated is false.

I think the regressions here are too large and so we would like to understand what is going on. It might be a bug in the mono runtime or something else.

@tannergooding
Copy link
Member

tannergooding commented Nov 10, 2022

I think the regressions here are too large and so we would like to understand what is going on. It might be a bug in the mono runtime or something else.

In the case of things like BitwiseAnd, a correctness fix was taken to ensure that indirect invocation (reflection, delegates, etc) behaved the same as the direct invocation scenario when AVX2 was supported. Previously, operator & would only ever touch the lower 128-bits in such scenarios, but the type is resized by the VM to be 256-bits. This lead to debugger operations and other scenarios being inconsistent.

Similar fixes were taken elsewhere as well. This results in more logic having to be done for the software fallback and means that a regression will be visible when IsHardwareAccelerated == false. For the SIMD accelerated case, these methods are intrinsic and so the actual IL doesn't matter. We now have more methods recognized directly as intrinsic and so less work needs to be done by the JIT to ensure good codegen. This is a net win for platforms that have the SIMD handling already and no change to platforms without (it is just adding an [Intrinsic] attribute, so no impact to the actual code).

Like I mentioned, the software fallback is not meant or expected to be used. It's there to support scenarios like the debugger and simple correctness validation on platforms without acceleration. We expose IsHardwareAccelerated and expect users to call it before utilizing the APIs in production scenarios. If it returns false, then the expectation is that the code is going to be slow, potentially much slower than a naive scalar algorithm due to needing a call and loop per operation.

Because of this, results should likely be ignored for any target where IsHardwareAccelerated == false. These will all be fixed and working as expected once Mono finishes enabling SIMD support. We can then track any perf differences between Mono and RyuJIT as fixes to the respective runtime handling of these intrinsics.

@radekdoulik
Copy link
Member Author

@vargaz when I check the difference in MicroTask.Repro1.RunStep I see:

(func Wasm_Browser_Bench_Sample_Sample_MicroTask_Repro1_RunStep(param i32, i32))
...
   local.tee $2
   global.set $__stack_pointer
-  i32.const 2297152
+  i32.const 2301696
   i32.load8.u
   i32.eqz
   if
-   i32.const 1376481
+   i32.const 1379105
    call mono_aot_Wasm_Browser_Bench_Sample_init_method
-   i32.const 2297152
+   i32.const 2301696
    i32.const 1
    i32.store8
...
    i32.load offset:12 align:2
    local.tee $0
-   i32.store offset:12 align:2
+   i32.store offset:28 align:2
    local.get $0
    i32.eqz
...
    br.if
    local.get $2
-   i32.const 2290552
+   i32.const 2295032
    i32.load align:2
    local.tee $0
...
    i64.load offset:8
    local.tee $3
-   i64.store offset:24 align:3
+   i64.store offset:16 align:3
    local.get $2
    local.get $0
...
    i64.load
    local.tee $4
-   i64.store offset:16 align:3
-   i32.const 2290544
+   i64.store offset:8 align:3
+   i32.const 2295024
    i32.load align:2
    local.tee $0
...
     i32.add
     i32.load8.u
-    i32.const 2290560
+    i32.const 2295040
     i32.load align:2
     local.tee $9
...
     i32.const 255
     i32.and
+    i32.eqz
     if
      local.get $0

I think the important difference is the added i32.eqz. I am not sure yet where does it come from, but it seems to loop to $label1 in the regressed version, while the original doesn't enter the if block and returns without looping.

I am going to look further into it, you might spot it faster though I guess. Also note that I am away tomorrow.

This is the complete regressed functuion.

(func Wasm_Browser_Bench_Sample_Sample_MicroTask_Repro1_RunStep(param $0 i32, $1 i32))
 local $2 i32
 local $3 i64
 local $4 i64
 local $5 i64
 local $6 i64
 local $7 i32
 local $8 i32
 local $9 i32
 global.get $__stack_pointer
 i32.const -1
 i32.add
 local.tee $2
 global.set $__stack_pointer
 i32.const 2301696
 i32.load8.u
 i32.eqz
 if
  i32.const 1379105
  call mono_aot_Wasm_Browser_Bench_Sample_init_method
  i32.const 2301696
  i32.const 1
  i32.store8
 
 block
  local.get $0
  i32.eqz
  br.if
  local.get $2
  local.get $0
  i32.load offset:12 align:2
  local.tee $0
  i32.store offset:28 align:2
  local.get $0
  i32.eqz
  br.if
  local.get $2
  i32.const 2295032
  i32.load align:2
  local.tee $0
  i64.load offset:8
  local.tee $3
  i64.store offset:16 align:3
  local.get $2
  local.get $0
  i64.load
  local.tee $4
  i64.store offset:8 align:3
  i32.const 2295024
  i32.load align:2
  local.tee $0
  i64.load offset:8
  local.set $5
  local.get $0
  i64.load
  local.set $6
  local.get $2
  local.get $3
  i64.store offset:56 align:3
  local.get $2
  local.get $5
  i64.store offset:40 align:3
  local.get $2
  local.get $4
  i64.store offset:48 align:3
  local.get $2
  local.get $6
  i64.store offset:32 align:3
  i32.const -1
  local.get $2
  i32.sub
  local.set $7
  i32.const -1
  local.get $2
  i32.sub
  local.set $8
  i32.const 0
  local.set $0
  loop
   local.get $0
   local.get $8
   i32.eq
   br.if
   local.get $0
   local.get $7
   i32.eq
   br.if
   local.get $2
   i32.const 32
   i32.add
   local.get $0
   i32.add
   i32.load8.u
   local.get $2
   i32.const 48
   i32.add
   local.get $0
   i32.add
   i32.load8.u
   i32.const 2295040
   i32.load align:2
   local.tee $9
   i32.load offset:4 align:2
   local.get $9
   i32.load align:2
   call.indirect (func (param i32, i32, i32) (result i32))
   i32.const 255
   i32.and
   i32.eqz
   if
    local.get $0
    i32.const 1
    i32.add
    local.tee $0
    i32.const 16
    i32.ne
    br.if
   
  
  local.get $2
  i32.const -1
  i32.sub
  global.set $__stack_pointer
  return
 
 call mini_llvmonly_throw_nullref_exception
 unreachable

@naricc
Copy link
Contributor

naricc commented Nov 10, 2022

These regressions are also showing up on llvm-aot where we should have support for intrinsics, but there maybe something else going on preventing them from being used. @fanyang-mono Should intrinsics end up used here on llvm?

I have been trying to repro to investigate but ran into some unrelated issues building.

@radekdoulik
Copy link
Member Author

I think the regressions here are too large and so we would like to understand what is going on. It might be a bug in the mono runtime or something else.

In the case of things like BitwiseAnd, a correctness fix was taken to ensure that indirect invocation (reflection, delegates, etc) behaved the same as the direct invocation scenario when AVX2 was supported. Previously, operator & would only ever touch the lower 128-bits in such scenarios, but the type is resized by the VM to be 256-bits. This lead to debugger operations and other scenarios being inconsistent.

Similar fixes were taken elsewhere as well. This results in more logic having to be done for the software fallback and means that a regression will be visible when IsHardwareAccelerated == false. For the SIMD accelerated case, these methods are intrinsic and so the actual IL doesn't matter. We now have more methods recognized directly as intrinsic and so less work needs to be done by the JIT to ensure good codegen. This is a net win for platforms that have the SIMD handling already and no change to platforms without (it is just adding an [Intrinsic] attribute, so no impact to the actual code).

Like I mentioned, the software fallback is not meant or expected to be used. It's there to support scenarios like the debugger and simple correctness validation on platforms without acceleration. We expose IsHardwareAccelerated and expect users to call it before utilizing the APIs in production scenarios. If it returns false, then the expectation is that the code is going to be slow, potentially much slower than a naive scalar algorithm due to needing a call and loop per operation.

Because of this, results should likely be ignored for any target where IsHardwareAccelerated == false. These will all be fixed and working as expected once Mono finishes enabling SIMD support. We can then track any perf differences between Mono and RyuJIT as fixes to the respective runtime handling of these intrinsics.

That makes sense. OTOH I wonder why it is now 4-6 times slower in things like InequalityOperatorBenchmark (Value1 != Value2) for Vector<byte> for example.

@tannergooding
Copy link
Member

Perhaps Mono has acceleration for operator == but not operator != and was relying on !(left == right).

The fallback was changed to be an explicit loop to help ensure consistency and correctness across the board. It also highlights if SIMD acceleration isn't enabled for the method and therefore potential missed optimization opportunities for modern hardware.

@fanyang-mono
Copy link
Member

These regressions are also showing up on llvm-aot where we should have support for intrinsics, but there maybe something else going on preventing them from being used. @fanyang-mono Should intrinsics end up used here on llvm?

I have been trying to repro to investigate but ran into some unrelated issues building.

The Vector microbenchmarks were written with generics. For the llvm-aot runs, those microbenchmarks methods would fall back to mini JIT. Mono mini JIT doesn't support SIMD intrinsics yet, so they are currently using the software fallback version. That's why they are experiencing regression as well.

@radekdoulik
Copy link
Member Author

I was blind and didn't notice the != operator change.

I think it is incorrect. It should return true when it finds first element inequality.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 10, 2022
@radekdoulik
Copy link
Member Author

radekdoulik commented Nov 10, 2022

With the #78177 the times are back to:

measurement time
Vector, Repro 1 InequalityOperator 0.0796us
Vector, Repro 2 InequalityOperator 0.0778us
Vector, Repro 3 BitwiseAndOperator 0.0433us

vargaz added a commit to vargaz/runtime that referenced this issue Nov 10, 2022
…hods containing type equality checks.

For example, calls to Vector<T>.IsSupported cannot be optimized away in a method where T
is T_BYTE because its written as:

```
            get => (typeof(T) == typeof(byte)) ||
                   (typeof(T) == typeof(double)) ||
```

and T_BYTE could be instantiated with an enum whose basetype is byte.

Fixes some of the issues in dotnet#78163.
@ghost ghost removed in-pr There is an active PR which will close this issue when it is merged untriaged New issue has not been triaged by the area owner labels Nov 11, 2022
vargaz added a commit that referenced this issue Nov 11, 2022
#78182)

* [mono][aot] Prefer concrete instances instead of gshared ones for methods containing type equality checks.

For example, calls to Vector<T>.IsSupported cannot be optimized away in a method where T
is T_BYTE because its written as:

```
            get => (typeof(T) == typeof(byte)) ||
                   (typeof(T) == typeof(double)) ||
```

and T_BYTE could be instantiated with an enum whose basetype is byte.

Fixes some of the issues in #78163.

* Avoid an assert when compiling Vector<object> instances.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants