Very slow thread shrinking after rapidly grow. #4585

sadko4u · 2020-02-18T09:20:48Z

Hello!

We're currently exploring the problem of spontaneous and rapid grow of thread pool after switching from old version of Jetty to the latest one. Currently there is no reason to say that the cause is Jetty internals but we've met a very strange behaviour at thread shrinking.

Consider looking at this monitoring graph:

It's a total amount of launched threads in Oracle JVM 1.8.x. You see a rapid grow here, a period when Prometheus system does not respond to statistic requests due to the reason that all jetty threads are busy, and then a very slow thread shrinking curve.

I believe this code is completely illegal and allows to shrink only one thread within idleTimeout period:
https://github.com/eclipse/jetty.project/blob/9d589f1b6e2b8cc09d5bf53ab00a68a74d7b7b3a/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L919-L929

The expected behaviour should look like one of LRU/LFU algorithm implementations: if thread does not perform any useful job within idleTimeout, it must be shrinked.

sbordet · 2020-02-18T10:34:53Z

The code is doing the slow shrinking by design: the idea is that if you had a load spike that caused the creation of lots of threads, you don't want to shrink too aggressively because you can have another load spike.
Therefore we only shrink 1 thread every idleTimeout rather than all of those that are idle.

Please make sure you are on the latest version of Jetty, since we have fixed a couple of issues related to the thread pool and shrinking recently.

sadko4u · 2020-02-18T11:21:40Z

The code is doing the slow shrinking by design: the idea is that if you had a load spike that caused the creation of lots of threads, you don't want to shrink too aggressively because you can have another load spike.

For that configuration we have idleTimeout parameter. If we know that the spike will occur within N seconds since last spike, then we'll probably just grow the idleTimout value.
By the other side, even low settings of idleTimeout (5 seconds) last shrinking process for hours. I believe in most cases minutes are enough to guarantee that we've served the spike and it won't occur in the nearest future.
Even more, if we will use MFU policy for splitting load between threads, we'll guarantee that LFU threads are unused for serving current load profile and should be dropped. Also, it won't clash with your EatWhatYouKill strategy because:

If you don't care which thread should execute a task, then this will be a MFU thread from the pool.
If you care about which thread should execute a task, then the thread will last until it executes such tasks.
If we use MFU strategy when serving tasks, then all EatWhatYouKill tasks will be executed in the same MFU threads.

sbordet · 2020-02-19T08:08:34Z

@gregw thoughts?

sbordet · 2020-02-19T08:51:49Z

Perhaps we can introduce expireThreadCount (or perhaps idleTimeoutDecay?) that is the number of threads that we attempt to expire for every idleTimeout.
With expireThreadCount=1 we have the current behavior, while @sadko4u can set it to a larger value if a more aggressive expiration is desired.

ReservedThreadExecutor is already idle timing out all threads at once, which means they will return to QueuedThreadPool.

QueuedThreadPool has the slow decay but does not have any Most Frequently Used policy - all threads are competing in BlockingQueue.poll(), and it's typically the Lock inside BlockingArrayQueue that wakes up threads, and that is a non-fair lock for performance reasons.

gregw · 2020-02-19T09:30:44Z

Firstly, what version is your "latest" ? We have had some recent releases that did create spikes in the number of threads due to a race in the new thread logic that created more than were needed.

I am firmly of the belief that we do not wish to let all threads die the moment they are idle. Having slow shrinking has been a worthwhile feature for a long time and helps to smooth the churn in thread count with uneven loads.

If you have idleTimeout set to 5s, then shrinking 350 threads (your max) will not take hours. It will take (350*5/60) = 29 minutes.

Note that our usage of the reserved thread pool does give us good MRU behaviour, as we prefer to use reserved threads and they are the MRU. So we tend to favour MRU threads and if we didn't then a small load would prevent shrinking, which we do not see.

The cost of an idle thread is small. Probably less than the cost of complex shrinking logic.

I'd much rather spend effort avoiding creating unnecessary threads in the first place. So let's make sure you are running a version that doesn't have the create thread race.

sadko4u · 2020-02-19T21:16:21Z

Thank you for your response.
I can tell tomorrow the verision we're using.

sadko4u · 2020-02-20T08:59:09Z

Currently we're using the following version: 9.4.26.v20200117.
I believe it's a latest stable release according to the contents of Maven repositories.

gregw · 2020-02-20T09:56:27Z

That is indeed the latest. 9.4.27 is just about to be built, but it only has a small tweak to the reserved thread pool. So you don't have any know reason for creating too many threads for a spike.

Do you believe the jump from 100 to 350 threads was reasonable? Could your load of had several hundred simultaneous requests in a burst?

Are you satisfied that setting a low idleTimeout on the thread pool can give you agressive shrinking if you want it (eg you could set 1s and shrink from 350 in 5 minutes).

sadko4u · 2020-02-20T10:27:01Z

Do you believe the jump from 100 to 350 threads was reasonable?

We can not currently say that the problem is Jetty itself. We noticed that our pretty rare JDK had SSL cache issues described here:
https://bugs.openjdk.java.net/browse/JDK-8129988

The overall problem is that suddenly one thread acquires lock on an SSL cache and forces all other threads to wait for about 2.5 minutes while the cache gets clean. Here's the most interesting stack trace of the problem thread from JVM:

"qtp1240232440-61782" #61782 prio=5 os_prio=0 tid=0x00007ff2ec0a9800 nid=0x730c runnable [0x00007ff3315e8000]
   java.lang.Thread.State: RUNNABLE
        at java.util.HashMap$TreeNode.find(HashMap.java:1881)
        at java.util.HashMap$TreeNode.find(HashMap.java:1877)
        at java.util.HashMap$TreeNode.find(HashMap.java:1877)
        at java.util.HashMap$TreeNode.find(HashMap.java:1877)
        at java.util.HashMap$TreeNode.find(HashMap.java:1877)
        at java.util.HashMap$TreeNode.find(HashMap.java:1877)
        at java.util.HashMap$TreeNode.find(HashMap.java:1877)
        at java.util.HashMap$TreeNode.find(HashMap.java:1877)
        at java.util.HashMap$TreeNode.find(HashMap.java:1877)
        at java.util.HashMap$TreeNode.getTreeNode(HashMap.java:1889)
        at java.util.HashMap.removeNode(HashMap.java:824)
        at java.util.HashMap.remove(HashMap.java:799)
        at sun.security.util.MemoryCache.emptyQueue(Cache.java:299)
        at sun.security.util.MemoryCache.put(Cache.java:361)
        - locked <0x00000001d3023050> (a sun.security.util.MemoryCache)
        at sun.security.ssl.SSLSessionContextImpl.put(SSLSessionContextImpl.java:178)

Currently we're observing the system for any other extra spikes and can not tell that the problem has completely gone.

sbordet · 2020-02-20T11:40:25Z

We noticed that our pretty rare JDK had SSL cache issues described here: https://bugs.openjdk.java.net/browse/JDK-8129988

This issue has been backported to OpenJDK 8, so you want to use a recent version of OpenJDK 8 to avoid the JVM lockup.

A JVM lockup like what you describe would definitely cause a spike in the number of threads when the lockup resolves.

What do you mean by "pretty rare JDK"?

sadko4u · 2020-02-20T12:25:27Z

What do you mean by "pretty rare JDK"?

I mean we've used Oracle JDK 1.8 below 1.8.211 version.
Now we've upgraded to the latest 1.8 version and obverving the behaviour of the system.

sbordet · 2020-03-13T15:26:54Z

@sadko4u news about the behavior of your system?

sadko4u · 2020-03-13T15:29:54Z

After Java update we haven't noticed rapid thread grow on our system.
I believe the topic can be closed if you won't implement possibility to use another thread shrinking strategy.

sbordet · 2020-03-13T15:40:02Z

@gregw I think we should discuss this further.

All the GC in modern versions of Java can now give back uncommitted memory to GC, so in cloud environment you can pay less if the server is idle.
Being able to control the decay rate of idle thread seems to be a good feature as it will reduce the memory needed by an idle server. Maybe for Jetty 10?

Note also that a reserved thread that expires goes back to the thread pool where it is subject (again) to idle time - in the best case a reserved thread expires in twice the idle timeout (thread pool idle timeout + reserved thread executor idle timeout).
But in the thread pool has more chances to be picked up randomly for a job and therefore not expire, so we may not really shrink at all.

gregw · 2020-03-13T16:51:49Z

@sbordet The threadpool already has it's own idleTimeout, which can be configured to an aggressive small number if desired. There is very little difference between shrinking 100 threads every 600ms vs 10 threads every 6000ms vs 1 thread every 60000ms.

The only tweaks that I can think of are:

making the idleTimeout of the RTE independent of the QTP timeout. Perhaps having a large number of reserved threads with a longer timeout and then a QTP with a short timeout might be a good configuration
Adding an option to use a fair lock on the queue for QTP - we'd have to benchmark if the fairness was worth the cost... it may also mean we are always using the coldest thread for any task, although the RTE will help shield that behaviour.

But I'm not really feeling that these are high priority changes. We would probably be better doing a full audit of our memory footprint to reduce memory per thread rather than reduce threads.

sbordet · 2020-03-13T18:35:08Z

Closing then.

hiteshk25 · 2023-01-27T22:03:30Z

Thanks for sharing this. @gregw

But I'm not really feeling that these are high priority changes. We would probably be better doing a full audit of our memory footprint to reduce memory per thread rather than reduce threads.

Probably we want to consider threadLocal usage of applications. That's where this LRU policy causes instability. As application just want to release resources after the spike.

sbordet closed this as completed Mar 13, 2020

magibney mentioned this issue Jan 31, 2023

Decouple QTP idleTimeout from pool shrink rate #9237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow thread shrinking after rapidly grow. #4585

Very slow thread shrinking after rapidly grow. #4585

sadko4u commented Feb 18, 2020 •

edited

Loading

sbordet commented Feb 18, 2020

sadko4u commented Feb 18, 2020

sbordet commented Feb 19, 2020

sbordet commented Feb 19, 2020

gregw commented Feb 19, 2020

sadko4u commented Feb 19, 2020

sadko4u commented Feb 20, 2020

gregw commented Feb 20, 2020

sadko4u commented Feb 20, 2020 •

edited

Loading

sbordet commented Feb 20, 2020

sadko4u commented Feb 20, 2020

sbordet commented Mar 13, 2020

sadko4u commented Mar 13, 2020

sbordet commented Mar 13, 2020

gregw commented Mar 13, 2020

sbordet commented Mar 13, 2020

hiteshk25 commented Jan 27, 2023

Very slow thread shrinking after rapidly grow. #4585

Very slow thread shrinking after rapidly grow. #4585

Comments

sadko4u commented Feb 18, 2020 • edited Loading

sbordet commented Feb 18, 2020

sadko4u commented Feb 18, 2020

sbordet commented Feb 19, 2020

sbordet commented Feb 19, 2020

gregw commented Feb 19, 2020

sadko4u commented Feb 19, 2020

sadko4u commented Feb 20, 2020

gregw commented Feb 20, 2020

sadko4u commented Feb 20, 2020 • edited Loading

sbordet commented Feb 20, 2020

sadko4u commented Feb 20, 2020

sbordet commented Mar 13, 2020

sadko4u commented Mar 13, 2020

sbordet commented Mar 13, 2020

gregw commented Mar 13, 2020

sbordet commented Mar 13, 2020

hiteshk25 commented Jan 27, 2023

sadko4u commented Feb 18, 2020 •

edited

Loading

sadko4u commented Feb 20, 2020 •

edited

Loading