Techempower benchmark spends 7.7% of time in LockSupport.unpark()? #30231
Replies: 15 comments 8 replies
-
@franz1981 might be interested. |
Beta Was this translation helpful? Give feedback.
-
Yep, thanks for having fun profiling :) Just a question: JProfiler allow profiling a docker application including native stack traces? I believe not, because one of the less known issues (maybe uncknown?) of LockSupport::unpark should be the contention on the Linux task list bucket futex/spin lock (see futex_wake for more info) and it should appear (with sampling profiling, not instrumented) as a proper CPU cost, given that spin locks under contention consume real CPU cycles! |
Beta Was this translation helpful? Give feedback.
-
Hi @franz1981 , AFAIK async profiling is the only way of seeing native stack traces in the profiling results. JProfiler in theory supports that, but I didn't manage to get it to work. However, I ran async-profiler on the non-reactive test and got an HTML tree with native stack traces. In there,
|
Beta Was this translation helpful? Give feedback.
-
Yep, but worth remembering that there's no free lunch here:
TLDR it should be addressed by users that knows their system and what kind of tasks they usually throw on the blocking thread pool and decide if make the system more concurrent on I/O side (reducing that awake cost as said before) or more capable of handling concurrent blocking takes instead. But usually you won't get both, for free. The other option, is to switch to full reactive and don't care anymore about these complex behaviours due to using separate configured resources to cooperate while providing a service. |
Beta Was this translation helpful? Give feedback.
-
Hm, if we consider only the inner workings of Now I wonder which threads can compete for one such spin lock? Would that be only the producer threads? Telling from the native stack trace and guessing a lot, it seems
My C++ and C skills aren't all that great, so I'm not sure what futex it is that Unfortunately, BTW the history of threads looks like this, with the maximum total number of threads at a time being 133 |
Beta Was this translation helpful? Give feedback.
-
The spin lock I have mentioned is explained here https://elixir.bootlin.com/linux/v6.1.3/source/kernel/futex/waitwake.c#L17 and each lock is shared through a hash, pointing to a specific bucket. For the techempower test, the type of tasks are both Database ones or just http (we are taking about the non reactive stack here), hence, very different from each others; finding a single optimal tuning is possible but won't worth the effort, given that in the near future we plan to remove such type of tests from techempower. |
Beta Was this translation helpful? Give feedback.
-
I must admit I haven't yet tried to understand in depth what that C code does. But I find it hard to believe that there should be a single wrongly-sized Hashmap baked into a central code path of the Linux kernel, that leads to CPU cycles being burned with hash collisions when the number of threads is only <= 133? That's the maximum number of threads I saw, and that is with 4 CPU cores. How much worse would this be with 16 cores, when thread pools are sized with the number of CPU cores as a factor? If that was really true, wouldn't it be easy for Red Hat to fix this by enlarging the Hashmap in the kernel? Think of all the CO2 that this could help saving ;) Do you per chance know the code line where the size of that Hashmap is determined? |
Beta Was this translation helpful? Give feedback.
-
I don't believe indeed, the code said it :P (and a friend of mine working as a performance eng in Suse on perf tools confirmed this nasty behaviour years ago).
I believe that value has been obtained after years of tests, including considering containerised applications: users that spot it as a major bottleneck can still override it. But I understand your point.
It should be here: |
Beta Was this translation helpful? Give feedback.
-
I'm afraid this isn't hash collisions, but actual spin waiting, i.e. busy waiting (but then I might have got something wrong here anyway). Assuming that
What I can find in the code is __raw_spin_unlock_irqrestore (with two underscores?), but that doesn't call |
Beta Was this translation helpful? Give feedback.
-
The spinning happen because of the hash collision: if 2 different threads share the same bucket spin they will end up competing over the same spin lock to enter in the mutual exclusive region
I don't have any ARM at hand but it's just the entry of the syscall I've posted few comments above i.e. futex_wake Anyway, in your case, the cost is not in the hash spin lock but on the process (there's no strong distinction between threads/processes from this pov) spin lock -> https://elixir.bootlin.com/linux/v6.1.3/source/kernel/sched/core.c#L4022 See here the explanation about what it is. |
Beta Was this translation helpful? Give feedback.
-
Telling from the native stack trace, I don't think the time is spent busy waiting. If you look at the bottom of the native stack trace, you can see that 6,97% percent of time is spent as self time within If you look at its sourcecode, at least on ARM you end up in this inlined code of
My new friend ChatGPT tells me this about
Does async-profiler report something wrong here, or could there really be a problem with this, at least on ARM? BTW, I found the corresponding |
Beta Was this translation helpful? Give feedback.
-
I asked the folks over at async-profiler for more insights here |
Beta Was this translation helpful? Give feedback.
-
@franz1981 if I fork the benchmark and try to make it easy to reproduce, would you be willing to try it on an Intel machine? |
Beta Was this translation helpful? Give feedback.
-
Hi @franz1981 , Otherwise, it might be either due to Linux on ARM or Docker on Mac, which AFAIK runs on a VM. WDYT? |
Beta Was this translation helpful? Give feedback.
-
Hi @franz1981 , on Linux I was able to do async profiling using JProfiler, and I'd like to share what it sees as the top ten of code hotspots: I'll open new theads here for the two that seem most interesting to me. |
Beta Was this translation helpful? Give feedback.
-
Hi,
out of curiosity, I profiled the Quarkus Techempower benchmark with JProfiler 11.1.4. The biggest "hotspot" that it reports is in
LockSupport.unpark()
, where allegedly 7.7% of all CPU time is spent:(I did use instrumentation for profiling, i.e. not sampling where JVM checkpoints might be interfering. Instrumentation here should not distort the results due to measuring overhead, as that happens only for methods with very short execution times).
The average execution time of
LockSupport.unpark()
seems very high, with 629 microseconds and comparatively few invocations (660K). For comparison, some String escaping method takes 21 microseconds on average with > 1 million invocations during the same run:On googling I only found this discussion which suggests that time spent in
LockSupport.unpark()
could be a typical problem of "reactive streams", with Akka as the example. Please note that the profiling method used there is async profiling, which is not supposed to suffer from the Safepoint bias problem. It might be that profiling using sampling with JVM checkpoints doesn't reveal this potential performance problem, while instrumentation does.Now as you guys blew my head with Quarkus Insights #107: Quarkus Runtime performance - a peek into JVM internals and its analysis of the JVM's performance of
instanceof
, I'm sure you will know whether there really could be a problem with the JVM's performance ofLockSupport.unpark()
, or maybe there is a reason why it could block for some time due to some kind of lock contention? Or, of course, there could be some problem with my profiling approach...Thanks for any answers!
Beta Was this translation helpful? Give feedback.
All reactions