Skip to content

logging hot threads on large queue of the management threadpool#140251

Merged
elasticsearchmachine merged 19 commits intoelastic:mainfrom
ywangd:ES-13904-hot-threads-management-thread-pool
Jan 9, 2026
Merged

logging hot threads on large queue of the management threadpool#140251
elasticsearchmachine merged 19 commits intoelastic:mainfrom
ywangd:ES-13904-hot-threads-management-thread-pool

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Jan 7, 2026

We have a requirement to tell why the management threadpool gets piled up for extended time duration. The idea is to log hot-threads when that happens. This can be done manually which is the case for now. But the issue is that the queue is often drained when a human operator notices the alert and takes action. This PR is to explore the option to automate this within Elasticsearch.

The changes add new logic to EsThreadPoolExecutor#execute so that it checks queue-size and tracks persisted large values. It eventually logs hot-threads if the queue does not drain in configured time. It has a few limitations:

  1. ThreadPool and related classes cannot register dynamic cluser settings so that the new settings are static, i.e. they require rolling restart to take effect.
  2. It requires a new task to be submitted to trigger the potential hot-threads. So it is possible that the large queue has sustained for longer then the configured time, i.e. the logging may not be always be in time. But this should not be too much of an issue since it is not very likely for the queue to stop moving for extended time at exactly the threshold.

A big question that I have is that whether this can be automated outside Elasticsearch? Today the alert requires human interaction which can be slow (thus too late for capturing at the right moment). But is it possible to add automated alert response so that it calls hot-threads API against the relevant cluster and this may have a much better chance to get the desired output? If possible, this would save the code change here and onging maintainence.

The internal approach with this PR does have its own advantages such as (1) guaranteed to get the necessary output (2) the hot-threads output could be more interesting since it includes the stacktrace for the thread that is submitting the task. This could give a better hint on what is hitting the management thread pool since existing running tasks on the pool may be from different sources and only those tasks being queued are relevant. Overall hot-threads is a close approximation to figure out "why the queue got piled up".

Resolves: ES-13904

elasticsearchmachine and others added 9 commits January 7, 2026 09:27
We should start fetching pages from the exchange sink asynchronously for 
each client. However, the current code should not be an issue, as we go
asynchronous when sending messages in production. Therefore, this is
marked as a non-issue rather than a bug.
…xecutorIT testCreatesEisChatCompletion_DoesNotRemoveEndpointWhenNoLongerAuthorized elastic#138480
Follow on from elastic#139769 to update some more tests for FP differences
@ywangd
Copy link
Member Author

ywangd commented Jan 8, 2026

Per discussion on ES-13904, my current conclusion is that we want to continue with this PR as well as adding support for possible external automation. The former helps satisfy the immediate need for investigation. The later requires additional work and could be useful for other future investigations.

@ywangd ywangd marked this pull request as ready for review January 8, 2026 03:44
@ywangd ywangd requested a review from a team as a code owner January 8, 2026 03:44
@ywangd ywangd requested review from nicktindall and removed request for a team January 8, 2026 03:45
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. labels Jan 8, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to do it this way, it seems like a good implementation

this.contextHolder = contextHolder;
this.hotThreadsOnLargeQueueConfig = hotThreadsOnLargeQueueConfig;
this.currentTimeMillisSupplier = currentTimeMillisSupplier;
this.hotThreadsLogger = new FrequencyCappedAction(currentTimeMillisSupplier, TimeValue.ZERO);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that the FrequencyCappedAction is not thread-safe, I imagine it's not the end of the world if we log twice, but worth noting.

Copy link
Member Author

@ywangd ywangd Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline and pushed 6202a70 to make the logging thread-safe.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about exposing this in settings that we must therefore continue to support for a long time. I don't think this is something we want to keep long-term.


// There may be racing on updating this field. It's OK since hot threads logging is very coarse grained time wise
// and can tolerate some inaccuracies.
private volatile long startTimeOfLargeQueue = NOT_TRACKED_TIME;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention millis in the name here? Ideally relativeMillis since there's no need for an absolute time right?

Copy link
Member Author

@ywangd ywangd Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name updated in f1c9fa4
Please see the other comment for relativeMillis

// and can tolerate some inaccuracies.
private volatile long startTimeOfLargeQueue = NOT_TRACKED_TIME;

private final AtomicLong lastLoggingTimeForHotThreads;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

handler,
contextHolder,
hotThreadsOnLargeQueueConfig,
System::currentTimeMillis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the cached time from the threadpool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was indeed the first thing I had in mind. Though it is technically possible. It implies quite a bit cascading changes to the builders, subclasses and utility methods. In addition, executors are built in ThreadPool's constructor so that we would need this-escape to pass the relativeTimeInMillis method. Overall I think it is not really worth the effort especially since we might not want to support it in long term.

final var lastLoggingTime = lastLoggingTimeForHotThreads.get();
if (now - lastLoggingTime >= hotThreadsOnLargeQueueConfig.intervalInMillis()
&& lastLoggingTimeForHotThreads.compareAndSet(lastLoggingTime, now)) {
HotThreads.logLocalHotThreads(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This emits its logs after some delay - might be worth logging a message right now saying we're starting to capture hot threads at this point too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in acd1665


public static final HotThreadsOnLargeQueueConfig DISABLED = new HotThreadsOnLargeQueueConfig(0, -1, -1);

public boolean isEnabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use == DISABLED rather than needing to call this method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do that, we will need to ensure the DISABLED instance is returned whenever the class is instantiated with sizeThreshold == 0. It is technically possible but would require private access to the class's constructor which means it cannot be a record and the need to add utility method. I am not sure whether this is a better trade-off compare to having this method. Please let me know if you think otherwise. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok we can go with this at least for now.

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jan 9, 2026
@ywangd
Copy link
Member Author

ywangd commented Jan 9, 2026

@DaveCTurner

I have some concerns about exposing this in settings that we must therefore continue to support for a long time. I don't think this is something we want to keep long-term.

This is a good point. I moved the settings registration and hence the test to the serverless side. Please see 61687b1 and the linked PR for details. Thanks a lot!

@ywangd ywangd requested a review from DaveCTurner January 9, 2026 01:12
Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a question about whether EsThreadPoolExecutorTestHelper could live in src/test instead of src/main? but perhaps not possible

this.hotThreadsOnLargeQueueConfig = hotThreadsOnLargeQueueConfig;
this.currentTimeMillisSupplier = currentTimeMillisSupplier;
this.lastLoggingTimeMillisForHotThreads = hotThreadsOnLargeQueueConfig.isEnabled()
? new AtomicLong(currentTimeMillisSupplier.getAsLong() - hotThreadsOnLargeQueueConfig.intervalInMillis())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: this could also just be zero couldn't it?

Copy link
Member Author

@ywangd ywangd Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Zero effectively means the logging has an initial delay of 60 min (the default interval) on node start. So we can miss the logging if a large queue size occurs within that time frame.

Copy link
Contributor

@nicktindall nicktindall Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it? because now would be millis since epoch, so now - 0 would be much larger than intervalInMillis?

if (now - lastLoggingTime >= hotThreadsOnLargeQueueConfig.intervalInMillis()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I miss understood your suggestion. Yes, it can technically be 0 since we are using absolute timer. I didn't do it because that make test manipulation harder, i.e. the initial time is fixed and cannot controlled by the unit test, hence the choice. I prefer to keep it as is for the time being.

@ywangd
Copy link
Member Author

ywangd commented Jan 9, 2026

whether EsThreadPoolExecutorTestHelper could live in src/test instead of src/main

As discussed elsewhere, this helper class is located in test/framework/src/main so that it is not part of the production code.

@ywangd
Copy link
Member Author

ywangd commented Jan 9, 2026

@DaveCTurner I have addressed your review comments and intend to merge this PR in next 12 hour so that it has a good chance to make next week's release. I am happy to address any other comments you may have with follow-ups if it got merged before you could re-review. Thanks! 🙏

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


public static final HotThreadsOnLargeQueueConfig DISABLED = new HotThreadsOnLargeQueueConfig(0, -1, -1);

public boolean isEnabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok we can go with this at least for now.

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 9, 2026
@elasticsearchmachine elasticsearchmachine merged commit d8c3c6f into elastic:main Jan 9, 2026
35 checks passed
@ywangd ywangd deleted the ES-13904-hot-threads-management-thread-pool branch January 9, 2026 12:40
szybia added a commit to szybia/elasticsearch that referenced this pull request Jan 9, 2026
* upstream/main: (76 commits)
  [Inference API] Get _services skips EIS authorization call if CCM is not configured (elastic#139964)
  Improve TSDB codec benchmarks with full encoder and compression metrics (elastic#140299)
  ESQL: Consolidate test `BlockLoaderContext`s (elastic#140403)
  ESQL: Improve Lookup Join performance with CachedDirectoryReader (elastic#139314)
  ES|QL: Add more examples for the match operator (elastic#139815)
  ESQL: Add timezone to add and sub operators, and ConfigurationAware planning support (elastic#140101)
  ESQL: Updated ToIp tests and generated documentation for map parameters (elastic#139994)
  Disable _delete_by_query and _update_by_query for CCS/stateful (elastic#140301)
  Remove unused method ElasticInferenceService.translateToChunkedResults (elastic#140442)
  logging hot threads on large queue of the management threadpool (elastic#140251)
  Search functions docs cleanup (elastic#140435)
  Unmute 350_point_in_time/point-in-time with index filter (elastic#140443)
  Remove unused methods (elastic#140222)
  Add CPS and `project_routing` support for `_mvt` (elastic#140053)
  Streamline `ShardDeleteResults` collection (elastic#140363)
  Fix Docker build to use --load for single-platform images (elastic#140402)
  Parametrize + test VectorScorerOSQBenchmark (elastic#140354)
  `RecyclerBytesStreamOutput` using absolute offsets (elastic#140303)
  Define bulk float native methods for vector scoring (elastic#139885)
  Make `TimeSeriesAggregate` `TimestampAware` (elastic#140270)
  ...
jimczi pushed a commit to jimczi/elasticsearch that referenced this pull request Jan 12, 2026
…tic#140251)

We have a requirement to tell why the management threadpool gets piled
up for extended time duration. The idea is to log hot-threads when that
happens. This can be done manually which is the case for now. But the
issue is that the queue is often drained when a human operator notices
the alert and takes action. This PR is to explore the option to automate
this within Elasticsearch. 

The changes add new logic to `EsThreadPoolExecutor#execute` so that it
checks queue-size and tracks persisted large values. It eventually logs
hot-threads if the queue does not drain in configured time. It has a few
limitations: 1. `ThreadPool` and related classes cannot register dynamic
cluser settings so that the new settings are static, i.e. they require
rolling restart to take effect. 2. It requires a new task to be
submitted to trigger the potential hot-threads. So it is possible that
the large queue has sustained for longer then the configured time, i.e.
the logging may not be always be in time. But this should not be too
much of an issue since it is not very likely for the queue to stop
moving for extended time at exactly the threshold.

A big question that I have is that whether this can be automated outside
Elasticsearch? Today the alert requires human interaction which can be
slow (thus too late for capturing at the right moment). But is it
possible to add automated alert response so that it calls hot-threads
API against the relevant cluster and this may have a much better chance
to get the desired output? If possible, this would save the code change
here and onging maintainence.

The internal approach with this PR does have its own advantages such as
(1) guaranteed to get the necessary output (2) the hot-threads output
could be more interesting since it includes the stacktrace for the
thread that is submitting the task. This could give a better hint on
what is hitting the management thread pool since existing running tasks
on the pool may be from different sources and only those tasks being
queued are relevant. Overall hot-threads is a close approximation to
figure out "why the queue got piled up".

Resolves: ES-13904
@repantis repantis added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Distributed Coordination/Distributed labels Jan 28, 2026
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
…ement queue size (elastic#5195)

This PR registers the settings added in elastic#140251 so that they are only configurable in serverless. It also adds tests for the logging behaviour.
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
…ement queue size (elastic#5195)

This PR registers the settings added in elastic#140251 so that they are only configurable in serverless. It also adds tests for the logging behaviour.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Core/Infra/Core Core issues without another label :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >non-issue serverless-linked Added by automation, don't add manually Team:Core/Infra Meta label for core/infra team Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants