logging hot threads on large queue of the management threadpool by ywangd · Pull Request #140251 · elastic/elasticsearch

ywangd · 2026-01-07T09:17:38Z

We have a requirement to tell why the management threadpool gets piled up for extended time duration. The idea is to log hot-threads when that happens. This can be done manually which is the case for now. But the issue is that the queue is often drained when a human operator notices the alert and takes action. This PR is to explore the option to automate this within Elasticsearch.

The changes add new logic to EsThreadPoolExecutor#execute so that it checks queue-size and tracks persisted large values. It eventually logs hot-threads if the queue does not drain in configured time. It has a few limitations:

ThreadPool and related classes cannot register dynamic cluser settings so that the new settings are static, i.e. they require rolling restart to take effect.
It requires a new task to be submitted to trigger the potential hot-threads. So it is possible that the large queue has sustained for longer then the configured time, i.e. the logging may not be always be in time. But this should not be too much of an issue since it is not very likely for the queue to stop moving for extended time at exactly the threshold.

A big question that I have is that whether this can be automated outside Elasticsearch? Today the alert requires human interaction which can be slow (thus too late for capturing at the right moment). But is it possible to add automated alert response so that it calls hot-threads API against the relevant cluster and this may have a much better chance to get the desired output? If possible, this would save the code change here and onging maintainence.

The internal approach with this PR does have its own advantages such as (1) guaranteed to get the necessary output (2) the hot-threads output could be more interesting since it includes the stacktrace for the thread that is submitting the task. This could give a better hint on what is hitting the management thread pool since existing running tasks on the pool may be from different sources and only those tasks being queued are relevant. Overall hot-threads is a close approximation to figure out "why the queue got piled up".

Resolves: ES-13904

We should start fetching pages from the exchange sink asynchronously for each client. However, the current code should not be an issue, as we go asynchronous when sending messages in production. Therefore, this is marked as a non-issue rather than a bug.

…xecutorIT testCreatesEisChatCompletion_DoesNotRemoveEndpointWhenNoLongerAuthorized elastic#138480

Follow on from elastic#139769 to update some more tests for FP differences

…management-thread-pool

ywangd · 2026-01-08T03:44:43Z

Per discussion on ES-13904, my current conclusion is that we want to continue with this PR as well as adding support for possible external automation. The former helps satisfy the immediate need for investigation. The later requires additional work and could be useful for other future investigations.

elasticsearchmachine · 2026-01-08T03:45:12Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2026-01-08T03:45:12Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nicktindall

If we decide to do it this way, it seems like a good implementation

nicktindall · 2026-01-08T04:27:28Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsThreadPoolExecutor.java

        this.contextHolder = contextHolder;
+        this.hotThreadsOnLargeQueueConfig = hotThreadsOnLargeQueueConfig;
+        this.currentTimeMillisSupplier = currentTimeMillisSupplier;
+        this.hotThreadsLogger = new FrequencyCappedAction(currentTimeMillisSupplier, TimeValue.ZERO);


One thing to note is that the FrequencyCappedAction is not thread-safe, I imagine it's not the end of the world if we log twice, but worth noting.

Discussed offline and pushed 6202a70 to make the logging thread-safe.

DaveCTurner

I have some concerns about exposing this in settings that we must therefore continue to support for a long time. I don't think this is something we want to keep long-term.

DaveCTurner · 2026-01-08T07:50:38Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsThreadPoolExecutor.java

+
+    // There may be racing on updating this field. It's OK since hot threads logging is very coarse grained time wise
+    // and can tolerate some inaccuracies.
+    private volatile long startTimeOfLargeQueue = NOT_TRACKED_TIME;


Can we mention millis in the name here? Ideally relativeMillis since there's no need for an absolute time right?

Name updated in f1c9fa4
Please see the other comment for relativeMillis

DaveCTurner · 2026-01-08T07:50:44Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsThreadPoolExecutor.java

+    // and can tolerate some inaccuracies.
+    private volatile long startTimeOfLargeQueue = NOT_TRACKED_TIME;
+
+    private final AtomicLong lastLoggingTimeForHotThreads;


Likewise here

DaveCTurner · 2026-01-08T07:51:09Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsThreadPoolExecutor.java

+            handler,
+            contextHolder,
+            hotThreadsOnLargeQueueConfig,
+            System::currentTimeMillis


Could we use the cached time from the threadpool?

That was indeed the first thing I had in mind. Though it is technically possible. It implies quite a bit cascading changes to the builders, subclasses and utility methods. In addition, executors are built in ThreadPool's constructor so that we would need this-escape to pass the relativeTimeInMillis method. Overall I think it is not really worth the effort especially since we might not want to support it in long term.

DaveCTurner · 2026-01-08T07:53:34Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsThreadPoolExecutor.java

+                final var lastLoggingTime = lastLoggingTimeForHotThreads.get();
+                if (now - lastLoggingTime >= hotThreadsOnLargeQueueConfig.intervalInMillis()
+                    && lastLoggingTimeForHotThreads.compareAndSet(lastLoggingTime, now)) {
+                    HotThreads.logLocalHotThreads(


This emits its logs after some delay - might be worth logging a message right now saying we're starting to capture hot threads at this point too.

Added in acd1665

DaveCTurner · 2026-01-08T07:56:59Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java

+
+        public static final HotThreadsOnLargeQueueConfig DISABLED = new HotThreadsOnLargeQueueConfig(0, -1, -1);
+
+        public boolean isEnabled() {


Could we use == DISABLED rather than needing to call this method?

To do that, we will need to ensure the DISABLED instance is returned whenever the class is instantiated with sizeThreshold == 0. It is technically possible but would require private access to the class's constructor which means it cannot be a record and the need to add utility method. I am not sure whether this is a better trade-off compare to having this method. Please let me know if you think otherwise. Thanks!

Ok we can go with this at least for now.

…management-thread-pool

ywangd · 2026-01-09T01:12:45Z

@DaveCTurner

I have some concerns about exposing this in settings that we must therefore continue to support for a long time. I don't think this is something we want to keep long-term.

This is a good point. I moved the settings registration and hence the test to the serverless side. Please see 61687b1 and the linked PR for details. Thanks a lot!

nicktindall

LGTM, just a question about whether EsThreadPoolExecutorTestHelper could live in src/test instead of src/main? but perhaps not possible

nicktindall · 2026-01-09T06:39:32Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsThreadPoolExecutor.java

+        this.hotThreadsOnLargeQueueConfig = hotThreadsOnLargeQueueConfig;
+        this.currentTimeMillisSupplier = currentTimeMillisSupplier;
+        this.lastLoggingTimeMillisForHotThreads = hotThreadsOnLargeQueueConfig.isEnabled()
+            ? new AtomicLong(currentTimeMillisSupplier.getAsLong() - hotThreadsOnLargeQueueConfig.intervalInMillis())


NIt: this could also just be zero couldn't it?

Not really. Zero effectively means the logging has an initial delay of 60 min (the default interval) on node start. So we can miss the logging if a large queue size occurs within that time frame.

Would it? because now would be millis since epoch, so now - 0 would be much larger than intervalInMillis?

if (now - lastLoggingTime >= hotThreadsOnLargeQueueConfig.intervalInMillis()

Sorry I miss understood your suggestion. Yes, it can technically be 0 since we are using absolute timer. I didn't do it because that make test manipulation harder, i.e. the initial time is fixed and cannot controlled by the unit test, hence the choice. I prefer to keep it as is for the time being.

ywangd · 2026-01-09T06:56:16Z

whether EsThreadPoolExecutorTestHelper could live in src/test instead of src/main

As discussed elsewhere, this helper class is located in test/framework/src/main so that it is not part of the production code.

ywangd · 2026-01-09T07:55:06Z

@DaveCTurner I have addressed your review comments and intend to merge this PR in next 12 hour so that it has a good chance to make next week's release. I am happy to address any other comments you may have with follow-ups if it got merged before you could re-review. Thanks! 🙏

DaveCTurner

LGTM

DaveCTurner · 2026-01-09T09:06:59Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java

+
+        public static final HotThreadsOnLargeQueueConfig DISABLED = new HotThreadsOnLargeQueueConfig(0, -1, -1);
+
+        public boolean isEnabled() {


Ok we can go with this at least for now.

* upstream/main: (76 commits) [Inference API] Get _services skips EIS authorization call if CCM is not configured (elastic#139964) Improve TSDB codec benchmarks with full encoder and compression metrics (elastic#140299) ESQL: Consolidate test `BlockLoaderContext`s (elastic#140403) ESQL: Improve Lookup Join performance with CachedDirectoryReader (elastic#139314) ES|QL: Add more examples for the match operator (elastic#139815) ESQL: Add timezone to add and sub operators, and ConfigurationAware planning support (elastic#140101) ESQL: Updated ToIp tests and generated documentation for map parameters (elastic#139994) Disable _delete_by_query and _update_by_query for CCS/stateful (elastic#140301) Remove unused method ElasticInferenceService.translateToChunkedResults (elastic#140442) logging hot threads on large queue of the management threadpool (elastic#140251) Search functions docs cleanup (elastic#140435) Unmute 350_point_in_time/point-in-time with index filter (elastic#140443) Remove unused methods (elastic#140222) Add CPS and `project_routing` support for `_mvt` (elastic#140053) Streamline `ShardDeleteResults` collection (elastic#140363) Fix Docker build to use --load for single-platform images (elastic#140402) Parametrize + test VectorScorerOSQBenchmark (elastic#140354) `RecyclerBytesStreamOutput` using absolute offsets (elastic#140303) Define bulk float native methods for vector scoring (elastic#139885) Make `TimeSeriesAggregate` `TimestampAware` (elastic#140270) ...

…tic#140251) We have a requirement to tell why the management threadpool gets piled up for extended time duration. The idea is to log hot-threads when that happens. This can be done manually which is the case for now. But the issue is that the queue is often drained when a human operator notices the alert and takes action. This PR is to explore the option to automate this within Elasticsearch. The changes add new logic to `EsThreadPoolExecutor#execute` so that it checks queue-size and tracks persisted large values. It eventually logs hot-threads if the queue does not drain in configured time. It has a few limitations: 1. `ThreadPool` and related classes cannot register dynamic cluser settings so that the new settings are static, i.e. they require rolling restart to take effect. 2. It requires a new task to be submitted to trigger the potential hot-threads. So it is possible that the large queue has sustained for longer then the configured time, i.e. the logging may not be always be in time. But this should not be too much of an issue since it is not very likely for the queue to stop moving for extended time at exactly the threshold. A big question that I have is that whether this can be automated outside Elasticsearch? Today the alert requires human interaction which can be slow (thus too late for capturing at the right moment). But is it possible to add automated alert response so that it calls hot-threads API against the relevant cluster and this may have a much better chance to get the desired output? If possible, this would save the code change here and onging maintainence. The internal approach with this PR does have its own advantages such as (1) guaranteed to get the necessary output (2) the hot-threads output could be more interesting since it includes the stacktrace for the thread that is submitting the task. This could give a better hint on what is hitting the management thread pool since existing running tasks on the pool may be from different sources and only those tasks being queued are relevant. Overall hot-threads is a close approximation to figure out "why the queue got piled up". Resolves: ES-13904

…ement queue size (elastic#5195) This PR registers the settings added in elastic#140251 so that they are only configurable in serverless. It also adds tests for the logging behaviour.

logging hot threads on large queue of the management threadpool

c231fcf

Resolves: ES-13904

ywangd added >non-issue :Core/Infra/Core Core issues without another label :Distributed Coordination/Distributed v9.4.0 labels Jan 7, 2026

elasticsearchmachine and others added 9 commits January 7, 2026 09:27

[CI] Auto commit changes from spotless

8088379

Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskE…

a920ab5

…xecutorIT testCreatesEisChatCompletion_DoesNotRemoveEndpointWhenNoLongerAuthorized elastic#138480

Allow a slight difference in rescored docs (elastic#139931)

0c5c6c2

Follow on from elastic#139769 to update some more tests for FP differences

Merge remote-tracking branch 'origin/main' into ES-13904-hot-threads-…

66c5171

…management-thread-pool

fix compilaation

ace3225

tests

53730f2

tweak

a644445

Merge remote-tracking branch 'origin/main' into ES-13904-hot-threads-…

5dabd1b

…management-thread-pool

ywangd marked this pull request as ready for review January 8, 2026 03:44

ywangd requested a review from a team as a code owner January 8, 2026 03:44

ywangd requested review from nicktindall and removed request for a team January 8, 2026 03:45

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. labels Jan 8, 2026

nicktindall approved these changes Jan 8, 2026

View reviewed changes

thread-safe logging

6202a70

DaveCTurner reviewed Jan 8, 2026

View reviewed changes

ywangd added 4 commits January 9, 2026 11:18

move settings registration and test

61687b1

add millis in names

f1c9fa4

extra logging

acd1665

Merge remote-tracking branch 'origin/main' into ES-13904-hot-threads-…

d9470bb

…management-thread-pool

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jan 9, 2026

ywangd requested a review from DaveCTurner January 9, 2026 01:12

ywangd added 3 commits January 9, 2026 12:40

license headers

0e6ec3d

move unit test

8328d5d

actually move unit test

84a98d4

nicktindall approved these changes Jan 9, 2026

View reviewed changes

DaveCTurner approved these changes Jan 9, 2026

View reviewed changes

Merge branch 'main' into ES-13904-hot-threads-management-thread-pool

5dbb68c

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 9, 2026

elasticsearchmachine merged commit d8c3c6f into elastic:main Jan 9, 2026
35 checks passed

ywangd deleted the ES-13904-hot-threads-management-thread-pool branch January 9, 2026 12:40

repantis added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Distributed Coordination/Distributed labels Jan 28, 2026


		public static final HotThreadsOnLargeQueueConfig DISABLED = new HotThreadsOnLargeQueueConfig(0, -1, -1);

		public boolean isEnabled() {

Conversation

ywangd commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywangd commented Jan 8, 2026

Uh oh!

elasticsearchmachine commented Jan 8, 2026

Uh oh!

elasticsearchmachine commented Jan 8, 2026

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd commented Jan 9, 2026

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd commented Jan 9, 2026

Uh oh!

ywangd commented Jan 9, 2026

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ywangd commented Jan 7, 2026 •

edited

Loading

ywangd Jan 8, 2026 •

edited

Loading

ywangd Jan 9, 2026 •

edited

Loading

ywangd Jan 9, 2026 •

edited

Loading

nicktindall Jan 9, 2026 •

edited

Loading