Binary search performance regressed in 1.25 #53823

Emerentius · 2018-08-30T14:30:23Z

The speedup obtained by the optimizations from #45333 which hit stable in 1.23 were mostly lost again from 1.25 onwards, because the compiler started emitting a branch instead of a cmov.

1.24

14  cmp     qword ptr [rsi + 8*rax], rcx
15  cmova   rax, r8

1.25

18  cmp     qword ptr [rsi + 8*r9], rcx
19  ja      .LBB0_7
20  mov     r8, r9
21  .LBB0_7:

From godbolt: https://godbolt.org/z/DUDl-T

Bench comparison with 1.23 nightly (2017-11-20, 9 days after PR merge)

 name                        old_1.23 ns/iter  new_1.23 ns/iter  diff ns/iter   diff %  speedup 
 binary_search_l1            70                28                         -42  -60.00%   x 2.50 
 binary_search_l1_with_dups  46                28                         -18  -39.13%   x 1.64 
 binary_search_l2            94                42                         -52  -55.32%   x 2.24 
 binary_search_l2_with_dups  68                42                         -26  -38.24%   x 1.62 
 binary_search_l3            267               234                        -33  -12.36%   x 1.14 
 binary_search_l3_with_dups  198               236                         38   19.19%   x 0.84

and status quo with 1.30 nightly (2018-08-15)

 name                        old_1.30 ns/iter  new_1.30 ns/iter  diff ns/iter   diff %  speedup 
 binary_search_l1            75                68                          -7   -9.33%   x 1.10 
 binary_search_l1_with_dups  46                61                          15   32.61%   x 0.75 
 binary_search_l2            99                89                         -10  -10.10%   x 1.11 
 binary_search_l2_with_dups  71                89                          18   25.35%   x 0.80 
 binary_search_l3            272               245                        -27   -9.93%   x 1.11 
 binary_search_l3_with_dups  212               256                         44   20.75%   x 0.83

The text was updated successfully, but these errors were encountered:

Emerentius · 2018-08-31T17:17:54Z

The likely culprit is the upgrade from LLVM 4 to 6. The upgrade was merged on the 10th of Feb. The binary search was fast on the 9th and slow on the 12th.

@estebank The optimization did hit stable (if only for 2 releases), so does it not need the regression-from-stable-to-stable label?

estebank · 2018-08-31T17:40:47Z

@Emerentius what do you mean for only 2 releases? Isn't the slowdown present in stable through nightly? Reading the assembly from godbolt it seems to me like there have been some changes but the code causing the slowdown presented in this issue is still there. If that is the case, then it would need that label, but if it was fixed in current stable, then we can close the ticket, as I do not believe we cut point releases for previous stable releases.

Emerentius · 2018-08-31T19:38:54Z

@estebank I mean it was fast in 1.23 stable and slow from 1.25 stable on, so fast for 2 stable releases, then regressed.

pnkfelix · 2018-11-29T14:24:21Z

triage: P-medium, since I don't think investigating this takes priority over other things currently being juggled. but also self-assigning since I'm curious and I think I can do some bisection as a background task, assuming I can reproduce the perf regression locally.

pnkfelix · 2018-11-29T15:56:06Z

discussed in T-compiler meeting. Re-prioritizing as P-high just for evaluating the scope of the regression (though some good work on that is already evident in this comment thread).

pnkfelix · 2018-12-12T15:54:55Z

Not sure why I previously removed I-slow; I'm putting it back now.

pnkfelix · 2018-12-12T16:05:22Z

Hey @Emerentius, what is the distinction you are making in the columns with the prefix "old_" and "new_" in the tables you present?

That is, I would have thought that "old_" somehow corresponds to the state prior to the changes from PR #45333 and "new_" corresponds to after PR #45333 landed; but you already established that PR #45333 landed in between the two versions that each have "old_" and "new_" labels applied to them.

So I don't understand what difference has been applied to either the code or the benchmark in between the two columns there. (Hypothesis: Is "old_" testing the binary search implementation prior to PR #45333, and "new_" testing the binary search implementation that was added by PR #45333 ?)

pnkfelix · 2018-12-12T16:08:16Z

(In any case I definitely am able to reproduce a regression on my Linux desktop on benches/slice between nightly-2018-02-09 and 2018-02-12, as anticipated/noted by #53823 (comment) ; there's a fair amount of noise between runs when looking at binary_search_l3 in my own machine, but binary_search_l1 is much more consistent in illustrating a 2x regression for me.)

Emerentius · 2018-12-12T17:22:08Z

@pnkfelix

Hypothesis: Is "old_" testing the binary search implementation prior to PR #45333, and "new_" testing the binary search implementation that was added by PR #45333 ?

That is exactly it

pnkfelix · 2018-12-13T13:13:32Z

Okay so I've done some preliminary analysis of the scope of the problem.

On my desktop linux machine, I built the benchmark suite in libcore/benches atop the nightly compilers from 2018-02-09 and 2018-02-12. Then I did three runs of the suite on each compiler (for a total of six runs).

Update: On the LLVM ticket associated with this issue that was subsequently filed, someone noted that I didn't indicate the processor model where I gathered this data. Here that is, in a details block.

% cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1728.267
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1545.735
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1588.301
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 4
initial apicid  : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 2202.815
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 6
initial apicid  : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1990.709
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 2112.084
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1789.977
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 5
initial apicid  : 5
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 2148.265
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Here is the raw data from that exercise: https://gist.github.com/8f511af7012c750928a6bdbf8efa3e6f

Then I put it into a spreadsheet, took the min (of the three runs) of the reported ns/iter for each benchmark, and asked to see the ratio (i.e. slowdown, where 2.0 is a 2x slowdown, 1.0 is no change, and 0.5 is a 2x speedup), comparing the min for 2018-02-09 to 2018-02-12.

(One might argue I should use average instead of minimum... but for understanding compiler output, I go along with the argument presented here that usually the outliers are much slower than normal, rather than faster, and so taking the minimum is more likely to filter out transient noise rather than artificial improvement)

Finally, when looking at the slowdown ratio, I asked for the values > 1.2 to be highlighted in red (so that we'd pay special attention to a 1.2x slowdown) and the values < 0.9 to be highlighted in green (so that we'd also notice reasonable speedups).

Here is that spreadsheet: https://www.dropbox.com/s/qznzchqhibi5o67/three-runs.xlsx?dl=0

pnkfelix · 2018-12-13T13:32:52Z

A number of benchmarks improved significantly (i.e. the slowdown ratio is <0.9) over the time period specified. I'm not going to list them here, but there are ~44 of them.

But here are the 9 cases that regressed:

test	2018-02-09 ns/iter (min of 3)	2018-02-12 ns/iter (min of 3)	slowdown
char::methods::bench_to_digit_radix_10	16,208	20,078	1.24x
iter::bench_zip_then_skip	2,580	3,899	1.51x
num::bench_u32_from_str_radix_2	17,896	23,223	1.30x
slice::binary_search_l1	21	51	2.43x
slice::binary_search_l1_with_dups	21	46	2.19x
slice::binary_search_l2	31	67	2.16x
slice::binary_search_l2_with_dups	31	67	2.16x
slice::binary_search_l3	108	141	1.31x
slice::binary_search_l3_with_dups	109	142	1.30x

Obviously the latter six cases are the ones that were already being pointed out by this bug.

(The other three entries in the table may or may not have the same core cause.)

Obviously the libcore/benches benchmark should not be the sole decider about whether a regression is significant, but based on these results, the impact does seem to be somewhat restricted in scope...

pnkfelix · 2018-12-14T12:59:50Z

Also, there are some pretty interesting discussions among LLVM developers about when they turn cmov into branches. It sounds there are a number of scenarios where this exhibits better performance due to (I think) either 1. the processor's branch prediction or 2. an expensive test condition that you'd rather evaluate just once

pnkfelix · 2018-12-14T13:12:48Z

In particular, in the LLVM developer discussions linked above, I saw mention of the following flag: -x86-cmov-converter=false

I gave that a whirl, and the results seem promising:

% rustc +nightly-2018-02-12 -O -C llvm-args=-x86-cmov-converter=false --test lib.rs -o ../benches-2018-02-12.nocmovconv
% rustc +nightly-2018-02-12 -O                                        --test lib.rs -o ../benches-2018-02-12
% rustc +nightly-2018-02-09 -O                                        --test lib.rs -o ../benches-2018-02-09
% ../benches-2018-02-09 --bench slice

running 6 tests
test slice::binary_search_l1                               ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups                     ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          31 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          31 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         110 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         112 ns/iter (+/- 7)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out

% ../benches-2018-02-12 --bench slice

running 6 tests
test slice::binary_search_l1                               ... bench:          51 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups                     ... bench:          46 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          67 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups                     ... bench:          67 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         143 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         142 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out

% ../benches-2018-02-12.nocmovconv --bench slice

running 6 tests
test slice::binary_search_l1                               ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups                     ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          31 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups                     ... bench:          31 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         114 ns/iter (+/- 2)
test slice::binary_search_l3_with_dups                     ... bench:         115 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out

pnkfelix · 2018-12-14T13:17:15Z

To be clear: I am not suggesting that we should turn the x86-cmov-converter pass off by default in rustc.

But:

we should check whether the improved version from https://reviews.llvm.org/D36081 is in our LLVM (and if not, consider cherry-picking it over), and
if we can reproduce this issue on the current LLVM version, then we should file an issue with them upstream.

nikic · 2018-12-14T13:27:36Z

Just checked, that patch is in our LLVM, and this still reproduces with current LLVM (running generated IR through llc).

nikic · 2018-12-14T13:53:51Z

~~I think the issue here might be that cmov conversion does not realize that it will not be possible to sink the cmovd chain into the created branches, because the condition also depends on them.~~ I misunderstood what this pass is intended to do. It's not interested in hoisting anything into the branches, the only point is to remove the dependency on the branch condition through speculation.

nikic · 2018-12-14T17:51:12Z

I've filed an LLVM issue: https://bugs.llvm.org/show_bug.cgi?id=40027

pnkfelix · 2018-12-17T11:48:35Z

Okay at this point I think I've convinced myself that the scope of this problem is somewhat limited,

So I'm going to downgrade this to P-medium.

It also isn't really going to be addressed on our end (except perhaps via cherry-picking a patch for the LLVM issue).

So I'm tempted to close it, but I'll maybe wait to let other members of @rust-lang/compiler weigh in before I do that.

pnkfelix · 2018-12-17T11:50:39Z

I'll leave myself assigned for now, but mostly as a reminder to close it in the future.

If someone else wants to take this issue and run with it, feel free to assign yourself and unassign me.

nikic · 2018-12-17T13:32:36Z

I think the main actionable point here is to try and change the binary search implementation in a way that does not run into this LLVM issue. Not sure how that would look like, but something that interchanges the order of the compare and the load (i.e. already load for the next iteration) might work, because this is a pattern that LLVM specifically checks for.

the8472 · 2021-07-04T11:32:08Z

Putting this on T-libs if they can find a libs-level solution.

I have tested several older revisions of the binary_search and partition_point implementations. All of them had their conditional moves converted to cmp + branch due to this llvm change.

This LLVM issue comment suggests that the primary heuristic which turns off this optimization only looks for binary tree searches (unpredictable pointer chasing) but not binary searches (unpredictable pointer arithmetic).

Another heuristic related to cost modelling has been added. Modifying its threshold via -C llvm-args=-x86-cmov-converter-threshold=10 does revert the old binary search implementation (see previous comment) to conditional moves. This doesn't work for the current, short-circuiting implementation from #74024 (but running that one on rust 1.24 does produce cmovs).
I don't think there's anything we could do on the libs side to trigger that heuristic with its current default (4).

The options I currently see are:

wait for an llvm fix (might require a new unpredictable intrinsic)
special-case x86 with assembly
tweak llvm-args

Only option 2 would be a pure libs thing. Some of those options would still require changes to the current implementation and benchmark runs. The old while size > 1 approach seems to be easier to turn into cmovs than current while left < right one. Or separate implementations partition_point and binary_search may be needed.

yaahc · 2021-07-19T15:52:33Z

We discussed this issue in our last libs team meeting and concluded that there's nothing left for libs to do, and that we need to wait for the fix to happen upstream in https://bugs.llvm.org/show_bug.cgi?id=40027. Our plan is to hand this over to the compiler team for tracking as discussed in https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/tracking.20an.20old.20regression. For now we've tagged both teams and nominated it so that the compiler team can confirm the handover in their next meeting.

pnkfelix · 2021-07-22T14:00:27Z

@yaahc @joshtriplett I'm happy to confirm handover to T-compiler, but the prioritization of this (currently P-low; P-medium as of 8 days ago) essentially means that we're never going to prioritize it for discussion in a meeting. It will just be task hanging out there for interested people to look at and discuss.

So, I just wanted to confirm: Is that an acceptable outcome from the point of view of you all? Or do you think this should be assigned higher priority?

joshtriplett · 2021-07-22T18:28:49Z

@pnkfelix That's roughly the outcome we were expecting.

We do feel that this would be worthwhile to have on the radar of a team whose primary purview is LLVM-related fixes that would benefit Rust, but we also recognize that that's a "somebody should", and we don't have a somebody to offer.

yaahc · 2021-07-22T23:44:11Z

What Josh said, I've gone ahead and removed the T-libs label, thanks for the confirmation @pnkfelix ^_^.

vlovich · 2023-09-01T00:12:02Z

Could something like https://docs.rs/cmov/latest/cmov/ be used in the interim to force the usage of cmov to recover the performance?

mohe2015 · 2023-09-01T00:19:56Z

Is profile guided optimization able to do anything in this case maybe?

nikic · 2023-09-14T07:52:42Z

The x86 backend should now respect __builtin_unpredicate for cmov conversion, so if we can expose that in rustc in some way, we can use that.

the8472 · 2023-09-14T10:33:00Z

There's a discussion what to do about likely()/unlikely() on IRLO. Adding an unpredictable() was mentioned too.

This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823

Amanieu · 2024-07-26T23:40:47Z

See #128254 for a potential fix.

This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823

Rewrite binary search implementation This PR builds on top of rust-lang#128250, which should be merged first. This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor. Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches. Fixes rust-lang#53823 Fixes rust-lang#115271

estebank added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Aug 30, 2018

estebank added the regression-from-stable-to-stable Performance or correctness regression from one stable version to another. label Aug 31, 2018

pnkfelix self-assigned this Nov 29, 2018

pnkfelix added P-high High priority and removed I-slow Issue: Problems and improvements with respect to performance of generated code. labels Nov 29, 2018

pnkfelix added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Dec 12, 2018

nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Dec 15, 2018

pnkfelix added P-medium Medium priority and removed P-high High priority labels Dec 17, 2018

arielb1 added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Dec 17, 2018

joshtriplett added P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed P-medium Medium priority labels Jul 14, 2021

yaahc added the I-nominated label Jul 19, 2021

apiraino removed the I-nominated label Jul 22, 2021

yaahc removed the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Jul 22, 2021

the8472 mentioned this issue Nov 11, 2021

Implement likely/unlikely intrinsic (tracking issue for RFC 1131) #26179

Open

4 tasks

GoldsteinE mentioned this issue May 17, 2023

Branch in string access ParkMyCar/compact_str#268

Closed

ryao mentioned this issue May 19, 2023

btree: Reduce opportunities for branch mispredictions in binary search openzfs/zfs#14866

Merged

13 tasks

nikic mentioned this issue Jun 2, 2023

[X86] cmov conversion hurts binary search performance llvm/llvm-project#39374

Open

Amanieu mentioned this issue Jul 26, 2024

Rewrite binary search implementation #128254

Merged

bors closed this as completed in bb58488 Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary search performance regressed in 1.25 #53823

Binary search performance regressed in 1.25 #53823

Emerentius commented Aug 30, 2018 •

edited

Loading

Emerentius commented Aug 31, 2018 •

edited

Loading

estebank commented Aug 31, 2018

Emerentius commented Aug 31, 2018

pnkfelix commented Nov 29, 2018

pnkfelix commented Nov 29, 2018

pnkfelix commented Dec 12, 2018

pnkfelix commented Dec 12, 2018

pnkfelix commented Dec 12, 2018 •

edited

Loading

Emerentius commented Dec 12, 2018

pnkfelix commented Dec 13, 2018 •

edited

Loading

pnkfelix commented Dec 13, 2018 •

edited

Loading

pnkfelix commented Dec 14, 2018 •

edited

Loading

pnkfelix commented Dec 14, 2018

pnkfelix commented Dec 14, 2018 •

edited

Loading

nikic commented Dec 14, 2018

nikic commented Dec 14, 2018 •

edited

Loading

nikic commented Dec 14, 2018

pnkfelix commented Dec 17, 2018

pnkfelix commented Dec 17, 2018

nikic commented Dec 17, 2018

the8472 commented Jul 4, 2021

yaahc commented Jul 19, 2021

pnkfelix commented Jul 22, 2021

joshtriplett commented Jul 22, 2021

yaahc commented Jul 22, 2021

vlovich commented Sep 1, 2023

mohe2015 commented Sep 1, 2023

nikic commented Sep 14, 2023

the8472 commented Sep 14, 2023

Amanieu commented Jul 26, 2024

Binary search performance regressed in 1.25 #53823

Binary search performance regressed in 1.25 #53823

Comments

Emerentius commented Aug 30, 2018 • edited Loading

Emerentius commented Aug 31, 2018 • edited Loading

estebank commented Aug 31, 2018

Emerentius commented Aug 31, 2018

pnkfelix commented Nov 29, 2018

pnkfelix commented Nov 29, 2018

pnkfelix commented Dec 12, 2018

pnkfelix commented Dec 12, 2018

pnkfelix commented Dec 12, 2018 • edited Loading

Emerentius commented Dec 12, 2018

pnkfelix commented Dec 13, 2018 • edited Loading

pnkfelix commented Dec 13, 2018 • edited Loading

pnkfelix commented Dec 14, 2018 • edited Loading

pnkfelix commented Dec 14, 2018

pnkfelix commented Dec 14, 2018 • edited Loading

nikic commented Dec 14, 2018

nikic commented Dec 14, 2018 • edited Loading

nikic commented Dec 14, 2018

pnkfelix commented Dec 17, 2018

pnkfelix commented Dec 17, 2018

nikic commented Dec 17, 2018

the8472 commented Jul 4, 2021

yaahc commented Jul 19, 2021

pnkfelix commented Jul 22, 2021

joshtriplett commented Jul 22, 2021

yaahc commented Jul 22, 2021

vlovich commented Sep 1, 2023

mohe2015 commented Sep 1, 2023

nikic commented Sep 14, 2023

the8472 commented Sep 14, 2023

Amanieu commented Jul 26, 2024

Emerentius commented Aug 30, 2018 •

edited

Loading

Emerentius commented Aug 31, 2018 •

edited

Loading

pnkfelix commented Dec 12, 2018 •

edited

Loading

pnkfelix commented Dec 13, 2018 •

edited

Loading

pnkfelix commented Dec 13, 2018 •

edited

Loading

pnkfelix commented Dec 14, 2018 •

edited

Loading

pnkfelix commented Dec 14, 2018 •

edited

Loading

nikic commented Dec 14, 2018 •

edited

Loading