-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SOLVED] Compiled binaries executing significantly slower #36705
Comments
I can confirm this on Sysprof recording showing where the time is spent is attached at mmstick/parallel#22 |
Also can confirm this on |
sysprofs-intel.zip Another sysprof recording but from my Intel laptop (ignore chrome streaming in the background). There's certainly something wonky happening with AMD hardware, looking at the provided sysprof from @Shnatsel. |
I would need to know:
|
Here's the result of |
Here's results from my AMD FX-8120 desktop: Compiling with |
Compilation is simply being done by |
And these are measurements taken with using just four threads instead of eight for the FX-8120. |
You should also use Anyway, even without symbols I can see that your AMD machine is mostly busy within just one particular function in the kernel:
Whereas the Intel machine is much more spread out:
But even the full Intel report doesn't any samples in the (And note that if you're doing this through Perhaps that's hitting a pathological case in the AMD case with something like page table setup, for instance. But we shouldn't just guess -- run |
Does that mean that it's an issue with the Linux kernel being poorly optimized for AMD processors with that specific function? |
Possibly, but let the data guide you. See what that symbol is first. Then you can also hit Have you tried the same benchmark with GNU parallel? I'll bet it will face the same problem, and if not you can try to figure out what it's doing differently. |
Using the same benchmark, GNU Parallel takes 30s compared to the 36s spent with the Rust implementation on my AMD FX-8120 using four cores. With my i5 2410M laptop, GNU Parallel takes 54 seconds to do what the Rust implementation does in 5 seconds. |
Not exactly sure what I'm looking at, but I have a new perf data with the archive data alongside it. |
Hmm, that new profile doesn't have the same 82% cluster. Is it still slow compared to Intel?
I can also see your kernel updated, from 4.7.3-2-ARCH before to 4.7.5-1-ARCH now. (Hit |
Indeed, it's still slow compared to Intel. This is an eight core FX @ 4GHz that's taking 36s to do what an i5 2410M @ 2.3 GHz does in 5s, and manages to run slower than GNU Parallel which does the same job in 30s. The wall time should be 3s to be comparable to the performance improvement that Intel sees (from 54s with GNU to 5s with Rust). |
Sadly I have no clue how to use Here's the sysprof capture so you can inspect it yourself I am on |
For whatever reason However, I could get |
Your sysprof shows that
(run that as root) If that helps you, it doesn't necessarily mean that you need to leave that disabled all the time. Spawning so many processes in such a short time is a pretty unrealistic workload. Do some performance testing with more typical use cases before deciding how to tune your system. We're pretty far afield from anything Rust can do about this, unless you find evidence of something else that Rust may be doing unoptimally. This doesn't look like a codegen issue to me. |
|
To confirm that the CPU is a red herring: If you |
Tried it in "madvise" mode and it's still fast, so Rust is probably doing everything right here, i.e. not advising the kernel to use huge pages. |
Seems that the issue was unrelated to rustc. Closing. |
I started looking into it and apparently this isn't the first case of a jemalloc-using application interacting poorly with transparent huge pages. Here's one from a little over a year ago: https://www.digitalocean.com/company/blog/transparent-huge-pages-and-alternative-memory-allocators/ (In the linked case, it was leak-like behaviour because THP was forcing 2MiB pages on jemalloc and it was assuming that applying |
I compiled my kernel with support for transparent_hugepages and tried out |
@Shnatsel You can use perf by running |
I can confirm that this is a bug with jemalloc. Forcing the use of system default allocator as described in https://doc.rust-lang.org/nightly/book/custom-allocators.html gives good fork performance even with transparent huge pages enabled. Too bad it is an unstable language feature. Is there any way to get jemalloc off my lawn without using unstable language features, like a compiler switch? |
cc @sfackler, another data point for jemalloc... |
@Shnatsel is there a tracking issue in jemallocs repo for this? |
I think you should report a bug against your Linux distribution that it should be using madvise, not always. |
@gnzlbg I have not opened one. I have no clue where the jemalloc bug tracker is |
As I've developed this Parallel application, I've noticed that everyone who uses the application with an AMD processor always reports significantly slower runtime performance than those using Intel processors, as in the case of my personal benchmark listed within. The performance difference is so drastic that my AMD FX-8120 is magnitudes slower than my mobile Intel i5, to the point that the original Perl implementation is much faster than the Rust implementation on AMD hardware, and yet on Intel hardware it is 40x slower than the Rust implementation.
The text was updated successfully, but these errors were encountered: