[SOLVED] Compiled binaries executing significantly slower #36705

mmstick · 2016-09-25T01:00:05Z

As I've developed this Parallel application, I've noticed that everyone who uses the application with an AMD processor always reports significantly slower runtime performance than those using Intel processors, as in the case of my personal benchmark listed within. The performance difference is so drastic that my AMD FX-8120 is magnitudes slower than my mobile Intel i5, to the point that the original Perl implementation is much faster than the Rust implementation on AMD hardware, and yet on Intel hardware it is 40x slower than the Rust implementation.

Shnatsel · 2016-09-25T01:04:53Z

I can confirm this on rustc 1.11.0 (9b21dcd6a 2016-08-15)

Sysprof recording showing where the time is spent is attached at mmstick/parallel#22

Shnatsel · 2016-09-25T01:09:02Z

Also can confirm this on rustc 1.13.0-nightly (4f9812a59 2016-09-21)

mmstick · 2016-09-25T03:58:53Z

sysprofs-intel.zip Another sysprof recording but from my Intel laptop (ignore chrome streaming in the background). There's certainly something wonky happening with AMD hardware, looking at the provided sysprof from @Shnatsel.

gnzlbg · 2016-09-25T11:58:06Z

I would need to know:

the results of perf in both systems, are the same functions hot in both systems?
perf gives you which part of the assembly in the hottest functions is "hot", gives us this info as well,
How are you compiling your application? Are you using target=native ? (It might well be that the assembly generated by LLVM's generic target benefits intel in your particular case).

mmstick · 2016-09-25T15:41:47Z

Here's the result of perf record -F 99 with both the debug and release build on my Intel laptop. Command executed was seq 1 10000 | parallel echo > /dev/null.

mmstick · 2016-09-25T16:04:37Z

Here's results from my AMD FX-8120 desktop:

Compiling with -C target_cpu=native makes no difference to performance.

mmstick · 2016-09-25T16:06:39Z

Compilation is simply being done by cargo build --release

mmstick · 2016-09-25T16:18:32Z

And these are measurements taken with using just four threads instead of eight for the FX-8120.

cuviper · 2016-09-26T18:03:20Z

You should also use perf archive to collect the objects used in your run, because perf.data alone doesn't contain any symbol information.

Anyway, even without symbols I can see that your AMD machine is mostly busy within just one particular function in the kernel:

Overhead  Command   Shared Object       Symbol                
  82.29%  parallel  [kernel.kallsyms]   [k] 0xffffffff812f8c37
   1.23%  parallel  [kernel.kallsyms]   [k] 0xffffffff812f8c3a
   0.58%  parallel  libpthread-2.24.so  [.] 0x0000000000009c0b
   0.41%  parallel  [kernel.kallsyms]   [k] 0xffffffff810fdf43
   0.41%  parallel  [kernel.kallsyms]   [k] 0xffffffff812f8adc
   0.33%  parallel  [kernel.kallsyms]   [k] 0xffffffff811e48bb
   0.29%  sh        [kernel.kallsyms]   [k] 0xffffffff810c3cba
   0.27%  parallel  [kernel.kallsyms]   [k] 0xffffffff8117a771
   0.26%  parallel  [kernel.kallsyms]   [k] 0xffffffff81301729
   0.26%  parallel  [kernel.kallsyms]   [k] 0xffffffff815d7337

Whereas the Intel machine is much more spread out:

Overhead  Command   Shared Object       Symbol                
  10.70%  parallel  [kernel.vmlinux]    [k] 0xffffffffaa0378e4
  10.24%  parallel  libpthread-2.22.so  [.] 0x0000000000009a47
   9.28%  sh        ld-2.22.so          [.] 0x0000000000017273
   5.82%  sh        [kernel.vmlinux]    [k] 0xffffffffaa49df6a
   5.24%  parallel  ld-2.22.so          [.] 0x0000000000015d68
   4.99%  parallel  [kernel.vmlinux]    [k] 0xffffffffaa105044
   4.93%  parallel  [kernel.vmlinux]    [k] 0xffffffffaa0b9057
   4.85%  parallel  [kernel.vmlinux]    [k] 0xffffffffaa0d760f
   4.68%  sh        libc-2.22.so        [.] 0x000000000007c55a
   4.59%  sh        [kernel.vmlinux]    [k] 0xffffffffaa0ae454

But even the full Intel report doesn't any samples in the parallel object itself, so it doesn't seem your benchmark is even stressing Rust generated code. With so much in the kernel and ld.so, I think you're mostly just hammering on process creation. If I understand right, you're launching 10,000 distinct echo processes, so that probably dwarfs the actual parallel work.

(And note that if you're doing this through sh -c, then it will actually get the shell's echo builtin, which would explain why the report only shows sh. Try with /bin/echo if you want that to be a truly distinct process too.)

Perhaps that's hitting a pathological case in the AMD case with something like page table setup, for instance. But we shouldn't just guess -- run perf report yourself to see what that 82% symbol is.

mmstick · 2016-09-26T18:21:03Z

Does that mean that it's an issue with the Linux kernel being poorly optimized for AMD processors with that specific function?

cuviper · 2016-09-26T18:27:43Z

Possibly, but let the data guide you. See what that symbol is first. Then you can also hit a in perf report to annotate that function with the actual instructions that recorded samples.

Have you tried the same benchmark with GNU parallel? I'll bet it will face the same problem, and if not you can try to figure out what it's doing differently.

mmstick · 2016-09-26T18:43:32Z

Using the same benchmark, GNU Parallel takes 30s compared to the 36s spent with the Rust implementation on my AMD FX-8120 using four cores. With my i5 2410M laptop, GNU Parallel takes 54 seconds to do what the Rust implementation does in 5 seconds.

mmstick · 2016-09-26T18:52:00Z

Not exactly sure what I'm looking at, but I have a new perf data with the archive data alongside it.
amd_perf_data.zip

cuviper · 2016-09-26T19:16:00Z

Hmm, that new profile doesn't have the same 82% cluster. Is it still slow compared to Intel?

Overhead  Command   Shared Object       Symbol                            
  14.20%  parallel  libpthread-2.24.so  [.] pthread_mutex_lock            
  10.69%  sh        libc-2.24.so        [.] __strchr_sse2                 
   5.04%  sh        libc-2.24.so        [.] strlen                        
   4.61%  parallel  libpthread-2.24.so  [.] __pthread_mutex_unlock_usercnt
   4.13%  parallel  [kernel.vmlinux]    [k] entry_SYSCALL_64              
   3.60%  sh        ld-2.24.so          [.] strcmp                        
   3.49%  sh        ld-2.24.so          [.] do_lookup_x                   
   3.47%  parallel  [kernel.vmlinux]    [k] page_fault                    
   3.13%  parallel  parallel            [.] sdallocx                      
   2.74%  sh        libc-2.24.so        [.] malloc

I can also see your kernel updated, from 4.7.3-2-ARCH before to 4.7.5-1-ARCH now. (Hit i in perf report.) If there was a kernel problem, maybe they've fixed it...

mmstick · 2016-09-26T19:32:34Z

Indeed, it's still slow compared to Intel. This is an eight core FX @ 4GHz that's taking 36s to do what an i5 2410M @ 2.3 GHz does in 5s, and manages to run slower than GNU Parallel which does the same job in 30s. The wall time should be 3s to be comparable to the performance improvement that Intel sees (from 54s with GNU to 5s with Rust).

Shnatsel · 2016-09-26T21:42:59Z

Sadly I have no clue how to use perf (links to howtos would be appreciated), but sysprof shows a 70% cluster in copy_page in the kernel:

Here's the sysprof capture so you can inspect it yourself

I am on Linux 4.4.0-38-generic #57-Ubuntu

Shnatsel · 2016-09-26T21:53:58Z

For whatever reason perf archive command seems to be missing on Ubuntu despite being listed in perf --help; I get perf: 'archive' is not a perf-command. See 'perf --help'. error.

However, I could get perf report to work, and it points at copy_page as well: 81,58% parallel [kernel.kallsyms] [k] copy_page and everything else is under 1%

cuviper · 2016-09-26T22:09:57Z

Your sysprof shows that copy_page is occurring under do_huge_pmd_page, which is related to the kernel's transparent huge pages. You might cat /sys/kernel/mm/transparent_hugepage/enabled and compare that between your Intel and AMD machines. On Fedora 24 I see always [madvise] never, meaning it's currently only using huge pages if advised to. Try never and see what happens.

# echo never >/sys/kernel/mm/transparent_hugepage/enabled

(run that as root)

If that helps you, it doesn't necessarily mean that you need to leave that disabled all the time. Spawning so many processes in such a short time is a pretty unrealistic workload. Do some performance testing with more typical use cases before deciding how to tune your system.

We're pretty far afield from anything Rust can do about this, unless you find evidence of something else that Rust may be doing unoptimally. This doesn't look like a codegen issue to me.

Shnatsel · 2016-09-26T22:13:27Z

[always] madvise never for me by default.

# echo never >/sys/kernel/mm/transparent_hugepage/enabled helped, it is fast even on AMD now. Thanks!

mcpherrinm · 2016-09-26T22:47:38Z

To confirm that the CPU is a red herring: If you echo always >/sys/kernel/mm/transparent_hugepage/enabled on an Intel system, do you see an equivalent slow down?

Shnatsel · 2016-09-26T23:23:15Z

Tried it in "madvise" mode and it's still fast, so Rust is probably doing everything right here, i.e. not advising the kernel to use huge pages.

Aatch · 2016-09-26T23:28:45Z

Seems that the issue was unrelated to rustc. Closing.

ssokolow · 2016-09-27T05:28:33Z

I started looking into it and apparently this isn't the first case of a jemalloc-using application interacting poorly with transparent huge pages. Here's one from a little over a year ago:

https://www.digitalocean.com/company/blog/transparent-huge-pages-and-alternative-memory-allocators/

(In the linked case, it was leak-like behaviour because THP was forcing 2MiB pages on jemalloc and it was assuming that applying MADV_DONTNEED to ranges smaller than 2MiB was enough for them to be returned to the OS.)

mmstick · 2016-09-27T15:35:13Z

I compiled my kernel with support for transparent_hugepages and tried out always on my Intel laptop. Results were that my Intel processor became just as slow, so it's not specific to AMD. It would also appear that Solus OS already knows about issues with this being set to always, as Solus OS has been using madvise by default for a while now.

mmstick · 2016-09-27T15:39:52Z

@Shnatsel You can use perf by running sudo perf record -F 99 <COMMAND> followed by sudo perf archive. You'll get a perf.data file from record and symbols from archive. Then you can use perf script to view it.

Shnatsel · 2016-10-04T01:13:35Z

I can confirm that this is a bug with jemalloc.

Forcing the use of system default allocator as described in https://doc.rust-lang.org/nightly/book/custom-allocators.html gives good fork performance even with transparent huge pages enabled.

Too bad it is an unstable language feature. Is there any way to get jemalloc off my lawn without using unstable language features, like a compiler switch?

alexcrichton · 2016-10-04T03:46:10Z

cc @sfackler, another data point for jemalloc...

gnzlbg · 2016-10-04T11:00:37Z

@Shnatsel is there a tracking issue in jemallocs repo for this?

mmstick · 2016-10-04T14:25:12Z

I think you should report a bug against your Linux distribution that it should be using madvise, not always.

Shnatsel · 2016-10-04T22:04:24Z

@gnzlbg I have not opened one. I have no clue where the jemalloc bug tracker is

djc · 2016-10-05T06:28:48Z

@Shnatsel https://github.com/jemalloc/jemalloc/issues

mmstick mentioned this issue Sep 25, 2016

Performs slower than GNU parallel when transparent huge pages are enabled mmstick/parallel#22

Closed

bstrie added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Sep 25, 2016

Aatch closed this as completed Sep 26, 2016

mmstick changed the title ~~Compiled binaries executing significantly slower on AMD hardware~~ [SOLVED] Compiled binaries executing significantly slower Sep 27, 2016

Shnatsel mentioned this issue Oct 4, 2016

Tracking issue for location of facade crates #27783

Closed

Shnatsel mentioned this issue Oct 4, 2016

Use system memory allocator in instrumented binaries rust-fuzz/afl.rs#97

Open

Shnatsel mentioned this issue Oct 16, 2016

Abysmal fork performance on Linux with transparent huge pages enabled jemalloc/jemalloc#473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SOLVED] Compiled binaries executing significantly slower #36705

[SOLVED] Compiled binaries executing significantly slower #36705

mmstick commented Sep 25, 2016

Shnatsel commented Sep 25, 2016

Shnatsel commented Sep 25, 2016

mmstick commented Sep 25, 2016

gnzlbg commented Sep 25, 2016

mmstick commented Sep 25, 2016

mmstick commented Sep 25, 2016

mmstick commented Sep 25, 2016

mmstick commented Sep 25, 2016

cuviper commented Sep 26, 2016

mmstick commented Sep 26, 2016

cuviper commented Sep 26, 2016

mmstick commented Sep 26, 2016

mmstick commented Sep 26, 2016

cuviper commented Sep 26, 2016

mmstick commented Sep 26, 2016 •

edited

Loading

Shnatsel commented Sep 26, 2016

Shnatsel commented Sep 26, 2016

cuviper commented Sep 26, 2016

Shnatsel commented Sep 26, 2016

mcpherrinm commented Sep 26, 2016

Shnatsel commented Sep 26, 2016

Aatch commented Sep 26, 2016

ssokolow commented Sep 27, 2016

mmstick commented Sep 27, 2016

mmstick commented Sep 27, 2016

Shnatsel commented Oct 4, 2016

alexcrichton commented Oct 4, 2016

gnzlbg commented Oct 4, 2016

mmstick commented Oct 4, 2016

Shnatsel commented Oct 4, 2016

djc commented Oct 5, 2016

[SOLVED] Compiled binaries executing significantly slower #36705

[SOLVED] Compiled binaries executing significantly slower #36705

Comments

mmstick commented Sep 25, 2016

Shnatsel commented Sep 25, 2016

Shnatsel commented Sep 25, 2016

mmstick commented Sep 25, 2016

gnzlbg commented Sep 25, 2016

mmstick commented Sep 25, 2016

mmstick commented Sep 25, 2016

mmstick commented Sep 25, 2016

mmstick commented Sep 25, 2016

cuviper commented Sep 26, 2016

mmstick commented Sep 26, 2016

cuviper commented Sep 26, 2016

mmstick commented Sep 26, 2016

mmstick commented Sep 26, 2016

cuviper commented Sep 26, 2016

mmstick commented Sep 26, 2016 • edited Loading

Shnatsel commented Sep 26, 2016

Shnatsel commented Sep 26, 2016

cuviper commented Sep 26, 2016

Shnatsel commented Sep 26, 2016

mcpherrinm commented Sep 26, 2016

Shnatsel commented Sep 26, 2016

Aatch commented Sep 26, 2016

ssokolow commented Sep 27, 2016

mmstick commented Sep 27, 2016

mmstick commented Sep 27, 2016

Shnatsel commented Oct 4, 2016

alexcrichton commented Oct 4, 2016

gnzlbg commented Oct 4, 2016

mmstick commented Oct 4, 2016

Shnatsel commented Oct 4, 2016

djc commented Oct 5, 2016

mmstick commented Sep 26, 2016 •

edited

Loading