Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Regressions in System.Collections.TryGetValueFalse<String, String> #51258

Closed
DrewScoggins opened this issue Apr 14, 2021 · 26 comments
Closed
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@DrewScoggins
Copy link
Member

DrewScoggins commented Apr 14, 2021

Run Information

Architecture x64
OS ubuntu 18.04
Baseline 59c592cc8d2778bcc6173baa2b25b13190e42990
Compare 6bfc5f21dea7b550f1c807454d45408ef34764e1
Diff Diff

Regressions in System.Collections.TryGetValueFalse<String, String>

Benchmark Baseline Test Test/Base Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
IDictionary 9.34 μs 11.07 μs 1.19
Dictionary 8.17 μs 9.92 μs 1.21

graph
graph
![graph]
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.TryGetValueFalse&lt;String, String&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.TryGetValueFalse<String, String>.IDictionary(Size: 512)


System.Collections.TryGetValueFalse<String, String>.Dictionary(Size: 512)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

category:performance
theme:benchmarks

@DrewScoggins DrewScoggins added os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels Apr 14, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels Apr 14, 2021
@ghost
Copy link

ghost commented Apr 14, 2021

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture x64
OS ubuntu 18.04
Baseline 59c592cc8d2778bcc6173baa2b25b13190e42990
Compare 6bfc5f21dea7b550f1c807454d45408ef34764e1
Diff Diff

Regressions in System.Collections.TryGetValueFalse<String, String>

Benchmark Baseline Test Test/Base Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
IDictionary 9.34 μs 11.07 μs 1.19
Dictionary 8.17 μs 9.92 μs 1.21

graph
graph
![graph]
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.TryGetValueFalse&lt;String, String&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.TryGetValueFalse<String, String>.IDictionary(Size: 512)


System.Collections.TryGetValueFalse<String, String>.Dictionary(Size: 512)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: DrewScoggins
Assignees: AndyAyersMS
Labels:

arch-x64, area-System.Collections, os-linux, tenet-performance, tenet-performance-benchmarks, untriaged

Milestone: -

@DrewScoggins
Copy link
Member Author

Run Information

Architecture x64
OS ubuntu 18.04
Baseline 59c592cc8d2778bcc6173baa2b25b13190e42990
Compare 6bfc5f21dea7b550f1c807454d45408ef34764e1
Diff Diff

Improvemnts in System.Collections.ContainsKeyFalse<String, String>

Benchmark Baseline Test Test/Base Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
IDictionary 11.48 μs 9.45 μs 0.82
Dictionary 9.95 μs 8.25 μs 0.83

graph
graph
![graph]
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.ContainsKeyFalse&lt;String, String&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.ContainsKeyFalse<String, String>.IDictionary(Size: 512)


System.Collections.ContainsKeyFalse<String, String>.Dictionary(Size: 512)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins
Copy link
Member Author

Looking at the full test trends for TryGetValue and ContainsKeyFalse, seems to show that when one gets faster the other get slower.

image
image

@AndyAyersMS
Copy link
Member

@DrewScoggins the zip archives linked above lose all file permissions (or else I'm doing it wrong). Makes it painful to download and then run the binaries.

andy@andy-ubuntu$ unzip -Zl 6d58d995-201b-444b-a499-a528190087f3.zip Core_Root/corerun 
?---------  2.0 unx   108912 b-    39058 defN 21-Apr-05 14:49 Core_Root/corerun

after an unzip:

---------- 1 andy andy   108912 Apr  5 14:49 corerun

Can we instead get a zipped up tar archive?

@AndyAyersMS
Copy link
Member

FWIW I can't repro the above regression with local builds on my unix box, which is why I wanted to grab the exact bits used by the runs.

@DrewScoggins
Copy link
Member Author

Those are the zips that we send to Helix, and we don't really have another place where we can easily get them. If I remember the only thing you had to chmod +x was corerun, and then everything worked. Maybe we can look at including a little shell script that setups the binaries for repro?

@AndyAyersMS
Copy link
Member

If this is what you send, I wonder what helix does to work around this?

Let me see if fixing corerun to be +x and all the others +r does the trick.

@AndyAyersMS
Copy link
Member

It requires more futzing about than just that, not quite sure what BDN is doing... at any rate I have the same non-result with the downloaded builds as I did with the local builds. Could be my ancient HW I suppose.

TryGetValueFalse

Base, Download

BenchmarkDotNet=v0.12.1.1521-nightly, OS=ubuntu 18.04
Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.3.21202.5
  [Host]     : .NET 6.0.0 (6.0.21.20104), X64 RyuJIT
  Job-CXOBIO : .NET 6.0.0 (6.0.21.20502), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
WarmupCount=1  
Method Size Mean Error StdDev Median Min Max
Dictionary 512 14.85 µs 0.060 µs 0.047 µs 14.84 µs 14.79 µs 14.93 µs
IDictionary 512 16.57 µs 0.105 µs 0.093 µs 16.58 µs 16.41 µs 16.71 µs
SortedList 512 420.46 µs 1.288 µs 1.142 µs 420.31 µs 418.88 µs 422.94 µs
SortedDictionary 512 460.61 µs 17.330 µs 19.957 µs 447.18 µs 443.50 µs 487.07 µs
ConcurrentDictionary 512 22.84 µs 0.076 µs 0.068 µs 22.81 µs 22.77 µs 23.00 µs
ImmutableDictionary 512 37.58 µs 0.208 µs 0.184 µs 37.57 µs 37.33 µs 37.91 µs
ImmutableSortedDictionary 512 422.83 µs 1.118 µs 0.933 µs 422.94 µs 421.12 µs 424.30 µs

Base, Local

BenchmarkDotNet=v0.12.1.1521-nightly, OS=ubuntu 18.04
Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.3.21202.5
  [Host]     : .NET 6.0.0 (6.0.21.20104), X64 RyuJIT
  Job-NDBUIM : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
WarmupCount=1  
Method Size Mean Error StdDev Median Min Max
Dictionary 512 14.88 µs 0.070 µs 0.062 µs 14.86 µs 14.80 µs 15.01 µs
IDictionary 512 16.67 µs 0.202 µs 0.169 µs 16.61 µs 16.53 µs 17.13 µs
SortedList 512 417.57 µs 1.038 µs 0.867 µs 417.49 µs 416.20 µs 418.93 µs
SortedDictionary 512 438.12 µs 3.399 µs 3.013 µs 438.20 µs 433.96 µs 443.15 µs
ConcurrentDictionary 512 22.92 µs 0.053 µs 0.044 µs 22.90 µs 22.86 µs 22.99 µs
ImmutableDictionary 512 36.94 µs 0.134 µs 0.126 µs 36.92 µs 36.75 µs 37.18 µs
ImmutableSortedDictionary 512 422.31 µs 0.916 µs 0.765 µs 422.20 µs 420.92 µs 424.09 µs

Diff, Download

BenchmarkDotNet=v0.12.1.1521-nightly, OS=ubuntu 18.04
Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.3.21202.5
  [Host]     : .NET 6.0.0 (6.0.21.20104), X64 RyuJIT
  Job-PLJKLO : .NET 6.0.0 (6.0.21.20602), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
WarmupCount=1  
Method Size Mean Error StdDev Median Min Max
Dictionary 512 14.94 µs 0.108 µs 0.101 µs 14.92 µs 14.82 µs 15.18 µs
IDictionary 512 16.73 µs 0.142 µs 0.126 µs 16.78 µs 16.49 µs 16.90 µs
SortedList 512 434.83 µs 1.179 µs 0.985 µs 434.84 µs 433.51 µs 436.95 µs
SortedDictionary 512 468.35 µs 1.755 µs 1.465 µs 468.34 µs 466.65 µs 471.78 µs
ConcurrentDictionary 512 23.12 µs 0.299 µs 0.280 µs 22.98 µs 22.84 µs 23.75 µs
ImmutableDictionary 512 37.81 µs 0.702 µs 0.548 µs 37.64 µs 37.23 µs 39.04 µs
ImmutableSortedDictionary 512 445.27 µs 5.198 µs 4.608 µs 442.97 µs 441.13 µs 454.60 µs

Diff, Local

BenchmarkDotNet=v0.12.1.1521-nightly, OS=ubuntu 18.04
Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.3.21202.5
  [Host]     : .NET 6.0.0 (6.0.21.20104), X64 RyuJIT
  Job-KXUFTT : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
WarmupCount=1  
Method Size Mean Error StdDev Median Min Max
Dictionary 512 14.74 µs 0.070 µs 0.062 µs 14.73 µs 14.64 µs 14.86 µs
IDictionary 512 16.36 µs 0.080 µs 0.075 µs 16.37 µs 16.27 µs 16.50 µs
SortedList 512 408.75 µs 3.088 µs 2.411 µs 407.43 µs 406.70 µs 413.31 µs
SortedDictionary 512 448.91 µs 1.650 µs 1.462 µs 449.24 µs 446.71 µs 451.09 µs
ConcurrentDictionary 512 22.67 µs 0.057 µs 0.048 µs 22.68 µs 22.58 µs 22.75 µs
ImmutableDictionary 512 37.48 µs 0.114 µs 0.101 µs 37.46 µs 37.31 µs 37.68 µs
ImmutableSortedDictionary 512 433.14 µs 3.722 µs 2.906 µs 433.14 µs 429.73 µs 436.68 µs

@AndyAyersMS
Copy link
Member

Also note the same tests on windows x64 sped up at that same point:

Windows

newplot (18)

Ubuntu

newplot (19)

(likewise for IDictionary)

I don't suppose we have ETL/IR data for those windows runs...?

@DrewScoggins
Copy link
Member Author

We do not, ETL collection has been really dodgy on lab machines as of late, with IR collection even worse. You could use the two Windows builds below to collect traces locally though.

Baseline
Compare

@AndyAyersMS AndyAyersMS added this to the 6.0.0 milestone Apr 17, 2021
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jun 11, 2021

Taking a fresh look, here's the recent history. We see 20% or so swings in perf. Also lately we seem to have reached a new low, but given history it's not clear how durable that is going to be.

Working back in time, these jumps all seem to be correlated with PGO updates:

newplot (57)

We should be able to use profiling to focus in on the key methods, and then check the optimization data we've gathered to see if it is fluctuating.

Also worth noting, the dictionary case (which is testing the same exact code), shows the same pattern of fluctuation.

newplot (56)

@AndyAyersMS
Copy link
Member

Profiling the test above (and filtering to just the "actual" timed runs done by BDN)
image

Looking at codegen, the most likely method impacted by PGO is FindValue; in the run I do locally it has a couple of GDV sites. GetgNonRandomizedHashCode does some block reordering with PGO, but looks like the impact of that should be small. So we'll focus on the PGO data for FindValue.

@AndyAyersMS
Copy link
Member

With current profile data, here are the PGO counts.

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight      IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             798k 797559    [000..008)-> BB03 ( cond )                     IBC 
BB02 [0001]  1                             0         0    [008..00E)                                     rare IBC 
BB03 [0002]  2                             798k 797559    [00E..01F)-> BB26 ( cond )                     IBC 
BB04 [0003]  1                             764k 763664    [01F..02C)-> BB18 ( cond )                     IBC 
BB05 [0004]  1                             158k 157883    [02C..060)-> BB12 ( cond )                     IBC 
BB06 [0005]  1                             0         0    [060..066)                                     rare IBC 
BB07 [0006]  2                             0         0    [066..071)-> BB26 ( cond )                     rare bwd bwd-target IBC 
BB08 [0007]  1                             0         0    [071..084)-> BB10 ( cond )                     rare bwd IBC 
BB09 [0008]  1                             0         0    [084..09A)-> BB24 ( cond )                     rare bwd IBC 
BB10 [0009]  2                             0         0    [09A..0B0)-> BB07 ( cond )                     rare bwd IBC 
BB11 [0010]  1                             0         0    [0B0..0B5)-> BB23 (always)                     rare IBC 
BB12 [0011]  1                             158k 157883    [0B5..0C2)                                     IBC 
BB13 [0012]  2                             176k 176445    [0C2..0CD)-> BB26 ( cond )                     bwd bwd-target IBC 
BB14 [0013]  1                             168k 167622    [0CD..0E0)-> BB16 ( cond )                     bwd IBC 
BB15 [0014]  1                             149k 149060    [0E0..0F3)-> BB24 ( cond )                     bwd IBC 
BB16 [0015]  2                           18562.  18562    [0F3..109)-> BB13 ( cond )                     bwd IBC 
BB17 [0016]  1                             0         0    [109..10B)-> BB23 (always)                     rare IBC 
BB18 [0017]  1                             606k 605781    [10B..130)                                     IBC 
BB19 [0018]  2                             874k 874419    [130..138)-> BB26 ( cond )                     bwd bwd-target IBC 
BB20 [0019]  1                             523k 522763    [138..14C)-> BB22 ( cond )                     bwd IBC 
BB21 [0020]  1                             254k 254125    [14C..15B)-> BB24 ( cond )                     bwd IBC 
BB22 [0021]  2                             269k 268638    [15B..171)-> BB19 ( cond )                     bwd IBC 
BB23 [0022]  3                             0         0    [171..176)                                     rare IBC 
BB24 [0023]  4                             403k 403185    [176..17D)                                     IBC 
BB25 [0024]  2                             798k 797559    [17D..17F)        (return)                     bwd-target IBC 
BB26 [0025]  4                             394k 394374    [17F..187)-> BB25 (always)                     bwd IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

the GDV sites in FindValue are both kind of marginal:

impImportBlockPending for BB24

    [ 1]  46 (0x02e) constrained. (1B000018) callvirt 06000505
In Compiler::impImportCall: opcode is callvirt, kind=4, callRetType is int, structSize is 0

impDevirtualizeCall: Trying to devirtualize virtual call:
    class for 'this' is __Canon (attrib 20020000)
    base method is Object::GetHashCode
--- no derived method: object class was canonical
    Class not final or exact, and method not final
Considering guarded devirtualization at IL offset 52 (0x34)
Likely class for 00007FFED5605540 (__Canon) is 00007FFED56090B8 (RuntimeType) [likelihood:37 classes seen:7]
virtual call would invoke method GetHashCode
Marking call [000234] as guarded devirtualization candidate; will guess for class RuntimeType

Importing BB15 (PC=224) of 'Dictionary`2:FindValue(__Canon):byref:this'
    [ 0] 224 (0x0e0) ldloc.s 7
    [ 1] 226 (0x0e2) ldloc.0
    [ 2] 227 (0x0e3) ldfld 0A000C61
    [ 2] 232 (0x0e8) ldarg.1
    [ 3] 233 (0x0e9) callvirt 0A00048B
In Compiler::impImportCall: opcode is callvirt, kind=4, callRetType is bool, structSize is 0

impDevirtualizeCall: Trying to devirtualize virtual call:
    class for 'this' is EqualityComparer`1 (attrib 20020400)
    base method is EqualityComparer`1::Equals
    devirt to EqualityComparer`1::Equals -- inexact or not final
               [000340] --CXG-------              *  CALLV vt-ind int    EqualityComparer`1.Equals
               [000336] ------------ this in rcx  +--*  LCL_VAR   ref    V09 loc7         
               [000338] ---XG------- arg1         +--*  FIELD     ref    key
               [000337] ------------              |  \--*  LCL_VAR   byref  V02 loc0         
               [000339] ------------ arg2         \--*  LCL_VAR   ref    V01 arg1         
    Class not final or exact, and method not final
Considering guarded devirtualization at IL offset 233 (0xe9)
Likely class for 00007FFED56B6458 (EqualityComparer`1) is 00007FFED587BE58 (ObjectEqualityComparer`1) [likelihood:40 classes seen:7]
virtual call would invoke method Equals
Marking call [000340] as guarded devirtualization candidate; will guess for class ObjectEqualityComparer`1

So we should look into the stability of these class profiles over time.

@AndyAyersMS
Copy link
Member

Comparing the codegen for FindValue at ccec848 (just before the June 7 update) to bbf9659 (a few days later), the only diff comes from a diff in the edge count info.

For the older data, we have

Profile summary: 8 runs, 0 block probes, 17 edge probes, 2 class profiles, 0 other records

Reconstructing block counts from sparse edge instrumentation
... adding known edge BB02 -> BB03: weight 0
... adding known edge BB07 -> BB26: weight 0
... adding known edge BB09 -> BB10: weight 0
... adding known edge BB09 -> BB24: weight 0
... adding known edge BB10 -> BB07: weight 0
... adding known edge BB13 -> BB26: weight 8927
... adding known edge BB15 -> BB16: weight 0
... adding known edge BB15 -> BB24: weight 150363
... adding known edge BB16 -> BB13: weight 39437
... adding known edge BB17 -> BB23: weight 0
... adding known edge BB19 -> BB26: weight 351612
... adding known edge BB21 -> BB22: weight 0
... adding known edge BB21 -> BB24: weight 253618
... adding known edge BB22 -> BB19: weight 268579
... adding known edge BB22 -> BB23: weight 0
... adding known edge BB24 -> BB25: weight 403980
... adding known edge BB25 -> BB01: weight 798427

and for the newer

Reconstructing block counts from sparse edge instrumentation
... adding known edge BB02 -> BB03: weight 0
... adding known edge BB07 -> BB26: weight 0
... adding known edge BB09 -> BB10: weight 0
... adding known edge BB09 -> BB24: weight 0
... adding known edge BB10 -> BB07: weight 0
... adding known edge BB13 -> BB26: weight 8823
... adding known edge BB15 -> BB16: weight 0
... adding known edge BB15 -> BB24: weight 149060
... adding known edge BB16 -> BB13: weight 18562
... adding known edge BB17 -> BB23: weight 0
... adding known edge BB19 -> BB26: weight 351656
... adding known edge BB21 -> BB22: weight 0
... adding known edge BB21 -> BB24: weight 254125
... adding known edge BB22 -> BB19: weight 268638
... adding known edge BB22 -> BB23: weight 0
... adding known edge BB24 -> BB25: weight 403185
... adding known edge BB25 -> BB01: weight 797559

They largely agree, save for the BB16 -> BB13 edge, which has a much lower count in the second collection.

After solving, this gives a higher weight for BB13, BB14, and BB16 (old profile below, new profile in the comment above)

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight      IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             798k 798427    [000..008)-> BB03 ( cond )                     IBC 
BB02 [0001]  1                             0         0    [008..00E)                                     rare IBC 
BB03 [0002]  2                             798k 798427    [00E..01F)-> BB26 ( cond )                     IBC 
BB04 [0003]  1                             765k 764520    [01F..02C)-> BB18 ( cond )                     IBC 
BB05 [0004]  1                             159k 159290    [02C..060)-> BB12 ( cond )                     IBC 
BB06 [0005]  1                             0         0    [060..066)                                     rare IBC 
BB07 [0006]  2                             0         0    [066..071)-> BB26 ( cond )                     rare bwd bwd-target IBC 
BB08 [0007]  1                             0         0    [071..084)-> BB10 ( cond )                     rare bwd IBC 
BB09 [0008]  1                             0         0    [084..09A)-> BB24 ( cond )                     rare bwd IBC 
BB10 [0009]  2                             0         0    [09A..0B0)-> BB07 ( cond )                     rare bwd IBC 
BB11 [0010]  1                             0         0    [0B0..0B5)-> BB23 (always)                     rare IBC 
BB12 [0011]  1                             159k 159290    [0B5..0C2)                                     IBC 
BB13 [0012]  2                             199k 198727    [0C2..0CD)-> BB26 ( cond )                     bwd bwd-target IBC 
BB14 [0013]  1                             190k 189800    [0CD..0E0)-> BB16 ( cond )                     bwd IBC 
BB15 [0014]  1                             150k 150363    [0E0..0F3)-> BB24 ( cond )                     bwd IBC 
BB16 [0015]  2                           39437.  39437    [0F3..109)-> BB13 ( cond )                     bwd IBC 
BB17 [0016]  1                             0         0    [109..10B)-> BB23 (always)                     rare IBC 
BB18 [0017]  1                             605k 605230    [10B..130)                                     IBC 
BB19 [0018]  2                             874k 873809    [130..138)-> BB26 ( cond )                     bwd bwd-target IBC 
BB20 [0019]  1                             522k 522197    [138..14C)-> BB22 ( cond )                     bwd IBC 
BB21 [0020]  1                             254k 253618    [14C..15B)-> BB24 ( cond )                     bwd IBC 
BB22 [0021]  2                             269k 268579    [15B..171)-> BB19 ( cond )                     bwd IBC 
BB23 [0022]  3                             0         0    [171..176)                                     rare IBC 
BB24 [0023]  4                             404k 403980    [176..17D)                                     IBC 
BB25 [0024]  2                             798k 798427    [17D..17F)        (return)                     bwd-target IBC 
BB26 [0025]  4                             394k 394447    [17F..187)-> BB25 (always)                     bwd IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

Impact of this on codegen is fairly minimal, we end up re-ordering some blocks at the end of the method.

@AndyAyersMS
Copy link
Member

Similar diffs looking back to the May 21 codegen, edge weights differ leading to mainly different block layouts but otherwise "equivalent" code. Some of the layout diffs also come from shifting likelihoods in guarded devirtualization.

@AndyAyersMS
Copy link
Member

On my box I'm able to repro about a 4% regression with the June 2 build, vs May 21 and June 7:

BenchmarkDotNet=v0.13.0.1555-nightly, OS=Windows 10.0.19043.1052 (21H1/May2021Update)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.100-preview.6.21275.3
  [Host]     : .NET 6.0.0 (6.0.21.27401), X64 RyuJIT
  Job-ERFQEC : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
  Job-TXTABM : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
  Job-HUUBVC : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1
Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio Gen 0 Gen 1 Gen 2 Allocated
IDictionary Job-ERFQEC June 7 512 11.28 us 0.158 us 0.148 us 11.27 us 11.00 us 11.59 us 0.99 - - - -
IDictionary Job-TXTABM June 2 512 11.70 us 0.179 us 0.167 us 11.72 us 11.42 us 11.90 us 1.03 - - - -
IDictionary Job-HUUBVC May 21 512 11.39 us 0.101 us 0.094 us 11.39 us 11.22 us 11.53 us 1.00 - - - -

None of the other key methods have diffs, so evidently the block layout changes in FindValue(__Canon) must be the cause of these perf swings.

Seems like if we fixed the class profile merge logic (see #48549 (comment)) that might lead to somewhat more stable profiles.

Not sure what leads to the remaining count instability. Could perhaps be lost counter updates from concurrency but would not expect such large swings.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 9, 2021
@danmoseley danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Collections labels Jul 14, 2021
@danmoseley
Copy link
Member

Updating area based on analysis above.

@AndyAyersMS
Copy link
Member

These tests are quite sensitive to fine details of PGO and perf swings back and forth depending on exact block layout.

Down the road we can likely improve block layout's algorithms to avoid being quite so sensitive. But I don't think we can address this for 6.0. So am going to move to future.

@AndyAyersMS AndyAyersMS modified the milestones: 6.0.0, Future Jul 27, 2021
@danmoseley
Copy link
Member

Are there any characteristics of regressions caused by "fine details of PGO", changing PGO data etc that we can spot in the graphs for other regressions? It feels like when I suspect PGO data I'm just waving my hands. Eg., would it look like bimodal between builds, but stable within iterations on the same build?

@AndyAyersMS
Copy link
Member

would it look like bimodal between builds, but stable within iterations on the same build?

Yes, this (more or less): if you look up at #51258 (comment) you can see that the perf of the test switches between two levels over time, and the timing of the swings is correlated with PGO updates.

There other other factors that can cause this sort of behavior (memory alignment of data, etc) so checking for the correlations with PGO updates is important.

@danmoseley
Copy link
Member

Is there a file in the repo that contains the PGO update, then I can correlate it's history (ie., how do I know when there was an update)

@danmoseley
Copy link
Member

memory alignment of data, etc

Do we still see significant impact from this? I guess there are next steps left in dotnet/performance#1602

On that subject, I guess remaining work on code alignment in #43227 will have to be prioritized one way or another for next cycle too.

@AndyAyersMS
Copy link
Member

If you look at the test history data from the lab you can select a particular perf jump and left-click on the before level and "set baseline" and do the same for the after level and set compare.

Then you can look at the commit range between the two; you're looking at something like this and within there you'll see an "Update Dependencies" PR -- within that you'll see updates to eng/Version.Details.xml, and in particular this one line that updates the PGO version:

- <Dependency Name="optimization.PGO.CoreCLR" Version="1.0.0-prerelease.21320.4">
+ <Dependency Name="optimization.PGO.CoreCLR" Version="1.0.0-prerelease.21329.4">

@danmoseley
Copy link
Member

Ah - that's what I need. I'll check that next time.

@adamsitnik
Copy link
Member

When looking at the last manual perf run for 6.0 I assumed that these benchmarks are just flaky:

System.Collections.ContainsKeyFalse<Int32, Int32>.SortedList(Size: 512)

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 15630.24 17268.63 0.91 +0 Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 20678.27 23181.68 0.89 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 20180.17 23033.57 0.88 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 28702.59 31566.71 0.91 +0 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 33899.56 40314.91 0.84 +0 Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) 5.0.921.35908 6.0.21.45401
Same 32344.42 35359.89 0.91 +0 Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701
Same 27818.40 30054.14 0.93 +0 Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.921.35908 6.0.21.41701
Same 37190.26 39400.55 0.94 +0 Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.921.35908 6.0.21.41701
Same 25109.38 26330.44 0.95 +0 Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.921.35908 6.0.21.41701
Same 26131.44 28154.31 0.93 +0 Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz 5.0.921.35908 6.0.21.41701
Slower 45064.25 62926.86 0.72 +0 Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz 5.0.721.25508 6.0.21.41701
Slower 18594.02 23898.72 0.78 +0 centos 8 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 18902.81 23519.43 0.80 +0 debian 10 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 19385.65 24385.02 0.79 +0 rhel 7 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 18758.27 23875.44 0.79 +0 sles 15 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 19208.14 22716.50 0.85 +0 opensuse-leap 15.3 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 25873.41 29614.95 0.87 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 34763.92 38984.71 0.89 +0 ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge) 5.0.921.35908 6.0.21.41701
Same 31299.91 33194.56 0.94 +0 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.921.35908 6.0.21.41701
Same 43801.13 44121.66 0.99 +0 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.41701
Same 33187.39 36221.29 0.92 +0 Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Same 34680.78 36189.27 0.96 +0 Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 15620.29 18899.32 0.83 +0 Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Same 28480.79 30515.61 0.93 +0 bimodal Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 32878.72 37487.28 0.88 +0 Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 34081.18 38946.04 0.88 +0 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 28952.99 32334.64 0.90 +0 macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 30580.86 34945.53 0.88 +0 macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701

System.Collections.TryGetValueFalse<Int32, Int32>.SortedList(Size: 512)

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Slower 14583.11 18196.25 0.80 +0 Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 19698.44 24552.94 0.80 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 19549.13 24505.74 0.80 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 28687.70 31988.37 0.90 +0 bimodal Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 34815.35 40048.39 0.87 +0 Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) 5.0.921.35908 6.0.21.45401
Same 32563.28 35785.74 0.91 +0 Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701
Same 27969.65 29438.33 0.95 +0 Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.921.35908 6.0.21.41701
Same 37241.70 39451.36 0.94 +0 Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.921.35908 6.0.21.41701
Same 24716.30 26593.07 0.93 +0 Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.921.35908 6.0.21.41701
Same 27215.48 27920.34 0.97 +0 Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz 5.0.921.35908 6.0.21.41701
Slower 46122.94 63292.95 0.73 +0 Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz 5.0.721.25508 6.0.21.41701
Slower 18494.56 22117.07 0.84 +0 centos 8 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 19264.81 25169.32 0.77 +0 debian 10 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 18693.45 22418.03 0.83 +0 rhel 7 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 20366.13 25532.80 0.80 +0 sles 15 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 19613.86 25289.33 0.78 +0 opensuse-leap 15.3 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 26467.75 30251.47 0.87 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 34747.86 38926.07 0.89 +0 ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge) 5.0.921.35908 6.0.21.41701
Slower 29441.72 33311.51 0.88 +0 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.921.35908 6.0.21.41701
Same 43243.67 43650.36 0.99 +0 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.41701
Same 33389.19 34600.58 0.96 +0 Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Same 35076.73 34480.20 1.02 +0 Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 15611.10 19623.90 0.80 +0 Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Same 27990.94 30754.16 0.91 +0 Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 35018.86 36948.07 0.95 +0 Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 33809.69 42191.42 0.80 +0 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 28922.73 33947.31 0.85 +0 macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 30420.86 35304.35 0.86 +0 macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701

But looking at the historical data it seems that we have slightly regressed SortedList:

image

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362%2fSystem.Collections.TryGetValueFalse(Int32%2c%20Int32).SortedList(Size%3a%20512).html

image

@AndyAyersMS
Copy link
Member

Stale issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
Status: Done
Development

No branches or pull requests

5 participants