Techempower benchmark spends 7.7% of time in LockSupport.unpark()? #30231

jfrantzius · 2023-01-06T18:12:19Z

jfrantzius
Jan 6, 2023

Hi,
out of curiosity, I profiled the Quarkus Techempower benchmark with JProfiler 11.1.4. The biggest "hotspot" that it reports is in LockSupport.unpark(), where allegedly 7.7% of all CPU time is spent:

(I did use instrumentation for profiling, i.e. not sampling where JVM checkpoints might be interfering. Instrumentation here should not distort the results due to measuring overhead, as that happens only for methods with very short execution times).

The average execution time of LockSupport.unpark() seems very high, with 629 microseconds and comparatively few invocations (660K). For comparison, some String escaping method takes 21 microseconds on average with > 1 million invocations during the same run:

On googling I only found this discussion which suggests that time spent in LockSupport.unpark() could be a typical problem of "reactive streams", with Akka as the example. Please note that the profiling method used there is async profiling, which is not supposed to suffer from the Safepoint bias problem. It might be that profiling using sampling with JVM checkpoints doesn't reveal this potential performance problem, while instrumentation does.

Now as you guys blew my head with Quarkus Insights #107: Quarkus Runtime performance - a peek into JVM internals and its analysis of the JVM's performance of instanceof, I'm sure you will know whether there really could be a problem with the JVM's performance of LockSupport.unpark(), or maybe there is a reason why it could block for some time due to some kind of lock contention? Or, of course, there could be some problem with my profiling approach...

Thanks for any answers!

mschorsch · 2023-01-07T08:41:21Z

mschorsch
Jan 7, 2023

@franz1981 might be interested.

0 replies

franz1981 · 2023-01-07T08:48:13Z

franz1981
Jan 7, 2023

Yep, thanks for having fun profiling :)
It's a known behaviour due to using the blocking thread pool (look at the awaken executor name, that's not a Netty one ie the I/O thread): such pools usually have a bunch of idle threads awaiting to be signaled and awaken to deal with blocking tasks. The more the pool is kept busy and rightly sized (both wait list capacity and threads count), no signaling should happen and no producer should be hit by its cost.
If such cost appear on the reactive tests, it's a problem (because it means the text configuration is making use of the blocking thread pool for no reasons!), but if it happens on the non reactive ones, is sadly very expected, although under some circumstances it could be amortized with better tuning (usually thread pool max capacity or core threads as mentioned in the Async-profiler issue).

Just a question: JProfiler allow profiling a docker application including native stack traces? I believe not, because one of the less known issues (maybe uncknown?) of LockSupport::unpark should be the contention on the Linux task list bucket futex/spin lock (see futex_wake for more info) and it should appear (with sampling profiling, not instrumented) as a proper CPU cost, given that spin locks under contention consume real CPU cycles!

0 replies

jfrantzius · 2023-01-07T18:56:45Z

jfrantzius
Jan 7, 2023
Author

Hi @franz1981 ,
you are right, this happens only with the non-reactive tests (I wrongly assumed I was running all tests, oops!)

AFAIK async profiling is the only way of seeing native stack traces in the profiling results. JProfiler in theory supports that, but I didn't manage to get it to work. However, I ran async-profiler on the non-reactive test and got an HTML tree with native stack traces. In there, Unsafe.unpark() turns up again, even with 9.64% of CPU time:

futex_wake turns up in the native stack trace as you predicted. I'm totally lost here as to how thread pool tuning might alleviate Linux futex/spin lock contention. But if non-reactive Quarkus really burns nearly 10% CPU in there, I guess it could help some people save some money if there is a way to avoid this :)

0 replies

franz1981 · 2023-01-08T10:38:07Z

franz1981
Jan 8, 2023

Yep, but worth remembering that there's no free lunch here:

blocking tasks take variable time to be executed (usually they involve some remote call to external systems)
reducing the number blocking threads will likely make some low concurrent use case faster (because no threads goes idle) and other slower (because each blocking task requires more available threads to match the number of max concurrent blocking ops)
making the blocking task queue bigger cause more memory usage (and can cause OOM) and, while saving producers I/O threads to pay such cost, will make the consumer task to pay it in the form of additional "wait time" i.e. time enqueued but still waiting to be executed - the bigger the q, the bigger it can takes before being executed

TLDR it should be addressed by users that knows their system and what kind of tasks they usually throw on the blocking thread pool and decide if make the system more concurrent on I/O side (reducing that awake cost as said before) or more capable of handling concurrent blocking takes instead. But usually you won't get both, for free.

The other option, is to switch to full reactive and don't care anymore about these complex behaviours due to using separate configured resources to cooperate while providing a service.

0 replies

jfrantzius · 2023-01-08T17:42:40Z

jfrantzius
Jan 8, 2023
Author

Hm, if we consider only the inner workings of Unsafe.unpark(), then what you wrote earlier seems to mean that the caller thread can end up busy looping for some time, because futex_wake() uses a spin lock? I must say that I find this very surprising and unexpected!

Now I wonder which threads can compete for one such spin lock? Would that be only the producer threads? Telling from the native stack trace and guessing a lot, it seems

Unsafe_Unpark calls
- Parker::unpark() which calls
  - pthread_cond_signal() , which maybe has this sourcecode

My C++ and C skills aren't all that great, so I'm not sure what futex it is that pthread_cond_signal() ends up spinning on. It seems that a Parker exists per thread, and it holds a _mutex and _cond which are used in calls to pthread_cond_wait() in Parker::park() and pthread_cond_signal() in Parker::unpark(). So maybe there is a futex per thread? So there would be multiple producer threads competing to wake up the same worker thread?

Unfortunately, EnhancedQueueExecutor.tryExecute() isn't exactly easy to understand in order to answer this...

BTW the history of threads looks like this, with the maximum total number of threads at a time being 133

0 replies

franz1981 · 2023-01-08T22:39:44Z

franz1981
Jan 8, 2023

The spin lock I have mentioned is explained here https://elixir.bootlin.com/linux/v6.1.3/source/kernel/futex/waitwake.c#L17 and each lock is shared through a hash, pointing to a specific bucket.
The number of bucket is fixed unless the kernel is recompiled and in short it means: the more threads you got behaving like that i.e. getting idle and requiring being awaken, the more you have chances (statistically) to have N hitting the same hash bucket and causing spin locks to spin due to contention.
In short: it's sadly an intrinsic OS property, but, as specified before, with the right ad-hoc configuration, threads can save being parked while still able to execute fast enough their backlog of tasks, finally saving any signaling to happen.
But it requires knowledge of the type of tasks, hence cannot be optimally configured upfront to satisfy any user.

For the techempower test, the type of tasks are both Database ones or just http (we are taking about the non reactive stack here), hence, very different from each others; finding a single optimal tuning is possible but won't worth the effort, given that in the near future we plan to remove such type of tests from techempower.

0 replies

jfrantzius · 2023-01-09T10:41:05Z

jfrantzius
Jan 9, 2023
Author

I must admit I haven't yet tried to understand in depth what that C code does. But I find it hard to believe that there should be a single wrongly-sized Hashmap baked into a central code path of the Linux kernel, that leads to CPU cycles being burned with hash collisions when the number of threads is only <= 133? That's the maximum number of threads I saw, and that is with 4 CPU cores. How much worse would this be with 16 cores, when thread pools are sized with the number of CPU cores as a factor?

If that was really true, wouldn't it be easy for Red Hat to fix this by enlarging the Hashmap in the kernel? Think of all the CO2 that this could help saving ;)

Do you per chance know the code line where the size of that Hashmap is determined?

0 replies

franz1981 · 2023-01-09T12:15:02Z

franz1981
Jan 9, 2023

find it hard to believe that there should be a single wrongly-sized Hashmap baked into a central code path of the Linux kernel

I don't believe indeed, the code said it :P (and a friend of mine working as a performance eng in Suse on perf tools confirmed this nasty behaviour years ago).
Consider that what the kernel sees in term of concurrency is rather different from what the user space believe to see and, that means that the kernel can observe max N === number cores parallel accesses to such shared (and hashed) map of spin locks.

If that was really true, wouldn't it be easy for Red Hat to fix this by enlarging the Hashmap in the kernel? Think of all the CO2 that this could help saving ;)

I believe that value has been obtained after years of tests, including considering containerised applications: users that spot it as a major bottleneck can still override it. But I understand your point.

Do you per chance know the code line where the size of that Hashmap is determined?

It should be here:
https://elixir.bootlin.com/linux/v6.1.3/source/kernel/futex/core.c#L1128

0 replies

jfrantzius · 2023-01-09T22:08:35Z

jfrantzius
Jan 9, 2023
Author

I'm afraid this isn't hash collisions, but actual spin waiting, i.e. busy waiting (but then I might have got something wrong here anyway). Assuming that CONFIG_BASE_SMALL is 0 in the kernel build, the hashmap size is 256 * num_possible_cpus(), which sounds reasonable. Also, the native stack trace in my previous comment has "spin" in it (_raw_spin_unlock_irqrestore).

- [52] 4.75% 2,477 self: 0.26% 134 pthread_cond_signal
  - [53] 4.42% 2,301 self: 0.00% 0 el0_sync
    - [54] 4.42% 2,301 self: 0.00% 0 el0_sync_handler
      - [55] 4.42% 2,301 self: 0.00% 0 el0_svc
        - [56] 4.41% 2,299 self: 0.19% 100 do_el0_svc
          - [57] 4.22% 2,199 self: 0.01% 3 __arm64_sys_futex
            - [58] 4.21% 2,194 self: 0.00% 2 do_futex
              - [59] 4.12% 2,148 self: 0.02% 9 futex_wake
                - [60] 4.02% 2,095 self: 0.07% 39 wake_up_q
                  - [61] 3.94% 2,054 self: 0.01% 3 wake_up_process
                    - [62] 3.94% 2,051 self: 0.00% 1 try_to_wake_up
                      - [63] 3.68% 1,916 self: 0.02% 9 _raw_spin_unlock_irqrestore
                        - [64] 3.66% 1,907 self: 3.64% 1,896 preempt_count_sub

What I can find in the code is __raw_spin_unlock_irqrestore (with two underscores?), but that doesn't call preempt_count_sub(), so I'm lost here. I cannot find __arm64_sys_futex in the first place, do you per chance know where that is?

0 replies

franz1981 · 2023-01-09T22:37:57Z

franz1981
Jan 9, 2023

I'm afraid this isn't hash collisions, but actual spin waiting

The spinning happen because of the hash collision: if 2 different threads share the same bucket spin they will end up competing over the same spin lock to enter in the mutual exclusive region

I cannot find __arm64_sys_futex in the first place, do you per chance know where that is?

I don't have any ARM at hand but it's just the entry of the syscall I've posted few comments above i.e. futex_wake

Anyway, in your case, the cost is not in the hash spin lock but on the process (there's no strong distinction between threads/processes from this pov) spin lock -> https://elixir.bootlin.com/linux/v6.1.3/source/kernel/sched/core.c#L4022

See here the explanation about what it is.

0 replies

jfrantzius · 2023-01-10T18:14:02Z

jfrantzius
Jan 10, 2023
Author

Telling from the native stack trace, I don't think the time is spent busy waiting. If you look at the bottom of the native stack trace, you can see that 6,97% percent of time is spent as self time within preempt_count_sub(), i.e. according to the profiler, the CPU cycles are spent only within that function itself.

If you look at its sourcecode, at least on ARM you end up in this inlined code of __preempt_count_sub():

static inline void __preempt_count_sub(int val)
{
	u32 pc = READ_ONCE(current_thread_info()->preempt.count);
	pc -= val;
	WRITE_ONCE(current_thread_info()->preempt.count, pc);
}

My new friend ChatGPT tells me this about READ_ONCE (and similarly for WRITE_ONCE):

READ_ONCE() is a macro in the Linux kernel that is used to read the value of a memory location in a way that ensures that the value is read atomically and without the possibility of a hardware reordering of memory accesses.

Does async-profiler report something wrong here, or could there really be a problem with this, at least on ARM?

BTW, I found the correspondingUnsafe.park() to use another whopping 6,83% of CPU time (according to async-profiler ...):

3 replies

franz1981 Jan 10, 2023

Nope, nothing wrong, I suppose, given that such functions usually take very few cycles (unless done repeatedly and/or causing cache misses) we are likely hitting a case of skidding. That's why I have blamed other nearby instructions.

You can search around on async profiler issues about it and I believe it would suggest to use -e cycles or other raw args that can help getting a more precise view od what's going on

jfrantzius Jan 10, 2023
Author

preempt_count_sub() only has inlined code or macros, no function calls that would show up on the stack, so I don't think async-profiler can tell more about it?

franz1981 Jan 10, 2023

I won't expect more, but more precision: to understand if what the profiler report is the real culprit or is elsewhere (the so called skidding I have mentioned before).

jfrantzius · 2023-01-10T19:57:21Z

jfrantzius
Jan 10, 2023
Author

I asked the folks over at async-profiler for more insights here

0 replies

jfrantzius · 2023-01-10T20:36:10Z

jfrantzius
Jan 10, 2023
Author

@franz1981 if I fork the benchmark and try to make it easy to reproduce, would you be willing to try it on an Intel machine?

2 replies

franz1981 Jan 10, 2023

We already have flamegraphs in our lab for all these tests, hence we are aware already of this "problem": furthermore, we already performed hyper parameters optimizations (using
an automated optimizer) to find a proper configuration for the blocking thread pool to mitigate it, without success, due to the very different nature of tasks across the different techempower tests.

jfrantzius Jan 11, 2023
Author

Wow, alright!
I guess you run the benchmarks on x86 boxes? And would you mind sharing the flamegraphs, BTW?

It seems you saw the suggestion at async-profiler to use -e cycles instead of -e cpu, did you try this out already? If I got it right, then -e cpu might see less function calls than -e cycles because of inhibited interrupts. But telling from what seems to be the sourcecode, there isn't much left that could go undetected, so I'm afraid that won't change the picture.

What I could try next is to profile on MacOS, but that will take me some time. If that doesn't show something similar, I'd speculate this to be a Linux problem...

jfrantzius · 2023-01-15T15:04:18Z

jfrantzius
Jan 15, 2023
Author

Hi @franz1981 ,
I just re-ran it on an x86 Linux box with -e cpu, and it seems the problem doesn't exist there:

Now when you wrote that you were already aware of the "problem" and tried to mitigate it with an automated optimizer, does that mean you were able to reproduce it?

Otherwise, it might be either due to Linux on ARM or Docker on Mac, which AFAIK runs on a VM. WDYT?

3 replies

franz1981 Jan 15, 2023

Unparking was costy regardless which spin-lock or other datastructure was stressed and it's the "problem" I was taking about.
Which solution has been already outlined in the previous comment and on the async profiler issue related reactive stacks ie ad-hoc setting of thread pool task queue and/or pool count.
In this regard you can try experiment with @RunOnVirtualThread that should be able to amortize the awaking cost by having a much busier thread pool (concurrency wise loom allow keeping the system much busier, without parking, under heavy load) and "shouldn't" suffer by the same problem

jfrantzius Jan 15, 2023
Author

Unsafe.unpark() makes up only 1.44% with Docker on Linux x86, do you still regard this as problematic?

franz1981 Jan 15, 2023

In your case I would say, no :)
But local vs real server ones are very different environments, especially if they compete with resources of clients and other noisy neighbours on the same physical box.
Thanks for the interests on the Quarkus stack :)
Let me know If there's anything more I could help ☺️

jfrantzius · 2023-01-18T09:11:27Z

jfrantzius
Jan 18, 2023
Author

Hi @franz1981 , on Linux I was able to do async profiling using JProfiler, and I'd like to share what it sees as the top ten of code hotspots:

I'll open new theads here for the two that seem most interesting to me.

0 replies

Techempower benchmark spends 7.7% of time in LockSupport.unpark()? #30231

Replies: 15 comments · 8 replies

jfrantzius Jan 7, 2023 Author

jfrantzius Jan 8, 2023 Author

jfrantzius Jan 9, 2023 Author

jfrantzius Jan 9, 2023 Author

jfrantzius Jan 10, 2023 Author

jfrantzius Jan 10, 2023 Author

jfrantzius Jan 10, 2023 Author

jfrantzius Jan 10, 2023 Author

jfrantzius Jan 11, 2023 Author

jfrantzius Jan 15, 2023 Author

jfrantzius Jan 15, 2023 Author

jfrantzius Jan 18, 2023 Author

Replies: 15 comments 8 replies

jfrantzius
Jan 7, 2023
Author

jfrantzius
Jan 8, 2023
Author

jfrantzius
Jan 9, 2023
Author

jfrantzius
Jan 9, 2023
Author

jfrantzius
Jan 10, 2023
Author

jfrantzius Jan 10, 2023
Author

jfrantzius
Jan 10, 2023
Author

jfrantzius
Jan 10, 2023
Author

jfrantzius Jan 11, 2023
Author

jfrantzius
Jan 15, 2023
Author

jfrantzius Jan 15, 2023
Author

jfrantzius
Jan 18, 2023
Author