Add search task watchdog to log hot threads on slow search by andreidan · Pull Request #142746 · elastic/elasticsearch

andreidan · 2026-02-20T10:33:57Z

Introduces an opt-in watchdog that logs hot threads when search tasks exceed configurable time thresholds. Each node monitors its own tasks via TaskManager, avoiding cross-node coordination complexity. Data nodes log when shard-level tasks (query/fetch) exceed threshold. Coordinators log only when the reduce/merge phase is slow, detected by checking that all child tasks have completed before logging.

This introduces the following settings:

search.task_watchdog.enabled (default: false)
search.task_watchdog.coordinator_threshold (default: 3s)
search.task_watchdog.data_node_threshold (default: 3s)
search.task_watchdog.interval (default: 1s)
search.task_watchdog.cooldown_period (default: 30s)

Introduces an opt-in watchdog that logs hot threads when search tasks exceed configurable time thresholds. Each node monitors its own tasks via TaskManager, avoiding cross-node coordination complexity. Data nodes log when shard-level tasks (query/fetch) exceed threshold. Coordinators log only when the reduce/merge phase is slow, detected by checking that all child tasks have completed before logging. This introduces the following settings: - search.task_watchdog.enabled (default: false) - search.task_watchdog.coordinator_threshold (default: 3s) - search.task_watchdog.data_node_threshold (default: 3s) - search.task_watchdog.interval (default: 1s) - search.task_watchdog.cooldown_period (default: 30s)

elasticsearchmachine · 2026-02-20T10:34:22Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2026-02-20T10:34:44Z

Hi @andreidan, I've created a changelog YAML for you.

github-actions · 2026-02-20T10:37:15Z

🔍 Preview links for changed docs

docs/reference/elasticsearch/configuration-reference/search-settings.md

github-actions · 2026-02-20T10:37:16Z

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

Check out the cumulative docs guidelines
Reach out in the #docs Slack channel

leemthompo

Docs LGTM 👌. Made minor suggestions in opening paragraphs to add couple of links and break a long sentence into two at the end, with a minor rewording for clarity.

docs/reference/elasticsearch/configuration-reference/search-settings.md

andreidan · 2026-02-20T14:44:57Z

Failure was #141734

spinscale

left a few minor comments, looking forward to the functionality!

docs/reference/elasticsearch/configuration-reference/search-settings.md

spinscale · 2026-02-23T08:32:08Z

server/src/main/java/org/elasticsearch/action/search/SearchTaskWatchdog.java

+        this.taskManager = taskManager;
+        this.threadPool = threadPool;
+
+        this.enabled = ENABLED.get(settings);


aren't all of these called with the settings update consumer anyway, so no need to call twice?

I'm not sure I understand. What are we calling twice?
We initialize enabled here and then subscribe to changes a bit later in the cosntructor.

Perhaps using intializeAndWatch is what you meant here? d1a0d29

spinscale · 2026-02-23T08:33:37Z

server/src/main/java/org/elasticsearch/action/search/SearchTaskWatchdog.java

+
+    private void setCoordinatorThreshold(long newCoordinatorThresholdValue) {
+        this.coordinatorThresholdNanos = newCoordinatorThresholdValue;
+        this.minThresholdNanos = computeMinThreshold(newCoordinatorThresholdValue, dataNodeThresholdNanos);


no need to pass variables, as you set this.coordinatorThresholdNanos here already and this.dataNodeThresholdNanos you can just use these in computeMinThreshold()?

I think it's more readable with parameters (i.e. it conveys what it does without having to step inside the method)

haha, for me it was the opposite. As we're essentially testing this.coordinatorThresholdNanos and this.dataNodeThresholdNanos, but passing different variable names, I considered it harder to read. The name conveys to me what it does.

I don't have a strong preference though.

server/src/main/java/org/elasticsearch/action/search/SearchTaskWatchdog.java

server/src/main/java/org/elasticsearch/tasks/TaskManager.java

server/src/main/java/org/elasticsearch/action/search/SearchTaskWatchdog.java

…on-sliced-reindex * upstream/main: (110 commits) Add search task watchdog to log hot threads on slow search (elastic#142746) Fix return_intermediate_results query param on get async search results (elastic#142875) Mute org.elasticsearch.compute.operator.exchange.BatchDriverTests testSinglePageSingleBatch elastic#142895 Cancel reindex body always has status (elastic#142766) Fix built-in roles sync losing updates (elastic#142433) ESQL: Clarify docs and add csv test for WHERE in STATS (elastic#133629) Fix and unmute ReindexResumeIT (elastic#142788) Fix broken release notes Mute org.elasticsearch.benchmark.vector.scorer.VectorScorerOSQBenchmarkTests testSingleScalarVsVectorized {p0=384 p1=4 p2=NIO p3=COSINE} elastic#142883 ES|QL: fix Generative tests for commands that don't change the output schema (elastic#142864) Mute org.elasticsearch.benchmark.vector.scorer.VectorScorerOSQBenchmarkTests testSingleScalarVsVectorized {p0=1024 p1=1 p2=NIO p3=DOT_PRODUCT} elastic#142881 SQL: Fix QlIllegalArgumentException with non-foldable date range queries (elastic#142386) Add more errors to the allowed_errors with github issue links (elastic#142862) ESQL: reapply "NDJSON datasource" (elastic#142855) Add implementation to update service settings method for Alibaba Cloud Search service (elastic#142738) Mute org.elasticsearch.snapshots.SnapshotShutdownIT testStartRemoveNodeButDoNotComplete elastic#142871 Mute org.elasticsearch.snapshots.SnapshotShutdownIT testDeleteSnapshotWithPausedShardSnapshots elastic#142870 Mute org.elasticsearch.snapshots.SnapshotShutdownIT testAbortSnapshotWhileRemovingNode elastic#142869 Mute org.elasticsearch.snapshots.SnapshotShutdownIT testRemoveNodeDuringSnapshot elastic#142868 ES|QL: Guard exponential_histogram TO_STRING against too large inputs (elastic#140718) ...

…42746) Introduces an opt-in watchdog that logs hot threads when search tasks exceed configurable time thresholds. Each node monitors its own tasks via TaskManager, avoiding cross-node coordination complexity. Data nodes log when shard-level tasks (query/fetch) exceed threshold. Coordinators log only when the reduce/merge phase is slow, detected by checking that all child tasks have completed before logging. This introduces the following settings: - search.task_watchdog.enabled (default: false) - search.task_watchdog.coordinator_threshold (default: 3s) - search.task_watchdog.data_node_threshold (default: 3s) - search.task_watchdog.interval (default: 1s) - search.task_watchdog.cooldown_period (default: 30s)

andreidan added the >enhancement label Feb 20, 2026

andreidan requested review from a team as code owners February 20, 2026 10:33

andreidan added the :Search Foundations/Search Catch all for Search Foundations label Feb 20, 2026

elasticsearchmachine added v9.4.0 Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels Feb 20, 2026

andreidan requested a review from spinscale February 20, 2026 10:34

Update docs/changelog/142746.yaml

e89be21

andreidan added 5 commits February 20, 2026 10:45

comment

fdb151b

docs applies_to marker

eef35d6

logging

61e67b8

stop, nano time

c4f3093

assertBusy as there's an element of time lapsing

874b19e

leemthompo reviewed Feb 20, 2026

View reviewed changes

docs/reference/elasticsearch/configuration-reference/search-settings.md Outdated Show resolved Hide resolved

andreidan added 3 commits February 20, 2026 13:51

Docs wording

6f9f7a7

docs suggestion

0eb8018

Merge branch 'main' into hotthreads-slow-search

b0a4c2b

Merge branch 'main' into hotthreads-slow-search

ffde4fe

spinscale reviewed Feb 23, 2026

View reviewed changes

andreidan added 2 commits February 23, 2026 11:19

Extract isInCooldownPeriod

57f95d4

threadpool.nanos

94079f9

andreidan requested a review from spinscale February 23, 2026 11:22

Use initializeAndWatch

d1a0d29

spinscale approved these changes Feb 23, 2026

View reviewed changes

andreidan added 2 commits February 23, 2026 14:34

setting enabled needs the interval

9d9c642

Merge branch 'main' into hotthreads-slow-search

a625979

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Feb 23, 2026

andreidan removed the serverless-linked Added by automation, don't add manually label Feb 23, 2026

Merge branch 'main' into hotthreads-slow-search

017fff8

andreidan merged commit c377cee into elastic:main Feb 23, 2026
35 checks passed

Conversation

andreidan commented Feb 20, 2026

Uh oh!

elasticsearchmachine commented Feb 20, 2026

Uh oh!

elasticsearchmachine commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Preview links for changed docs

Uh oh!

github-actions bot commented Feb 20, 2026

ℹ️ Important: Docs version tagging

When to use applies_to tags:

What NOT to do:

🤔 Need help?

Uh oh!

leemthompo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreidan commented Feb 20, 2026

Uh oh!

spinscale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

spinscale Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

andreidan Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreidan Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

spinscale Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

spinscale Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

andreidan Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

spinscale Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Feb 20, 2026 •

edited

Loading

andreidan Feb 23, 2026 •

edited

Loading