Skip to content

Add search task watchdog to log hot threads on slow search#142746

Merged
andreidan merged 17 commits intoelastic:mainfrom
andreidan:hotthreads-slow-search
Feb 23, 2026
Merged

Add search task watchdog to log hot threads on slow search#142746
andreidan merged 17 commits intoelastic:mainfrom
andreidan:hotthreads-slow-search

Conversation

@andreidan
Copy link
Copy Markdown
Contributor

Introduces an opt-in watchdog that logs hot threads when search tasks exceed configurable time thresholds. Each node monitors its own tasks via TaskManager, avoiding cross-node coordination complexity. Data nodes log when shard-level tasks (query/fetch) exceed threshold. Coordinators log only when the reduce/merge phase is slow, detected by checking that all child tasks have completed before logging.

This introduces the following settings:

  • search.task_watchdog.enabled (default: false)
  • search.task_watchdog.coordinator_threshold (default: 3s)
  • search.task_watchdog.data_node_threshold (default: 3s)
  • search.task_watchdog.interval (default: 1s)
  • search.task_watchdog.cooldown_period (default: 30s)

Introduces an opt-in watchdog that logs hot threads when search tasks
exceed configurable time thresholds. Each node monitors its own tasks
via TaskManager, avoiding cross-node coordination complexity.
Data nodes log when shard-level tasks (query/fetch) exceed threshold.
Coordinators log only when the reduce/merge phase is slow, detected by
checking that all child tasks have completed before logging.

This introduces the following settings:
- search.task_watchdog.enabled (default: false)
- search.task_watchdog.coordinator_threshold (default: 3s)
- search.task_watchdog.data_node_threshold (default: 3s)
- search.task_watchdog.interval (default: 1s)
- search.task_watchdog.cooldown_period (default: 30s)
@andreidan andreidan requested review from a team as code owners February 20, 2026 10:33
@andreidan andreidan added the :Search Foundations/Search Catch all for Search Foundations label Feb 20, 2026
@elasticsearchmachine elasticsearchmachine added v9.4.0 Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels Feb 20, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@andreidan andreidan requested a review from spinscale February 20, 2026 10:34
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @andreidan, I've created a changelog YAML for you.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 20, 2026

🔍 Preview links for changed docs

@github-actions
Copy link
Copy Markdown
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

Copy link
Copy Markdown
Member

@leemthompo leemthompo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs LGTM 👌. Made minor suggestions in opening paragraphs to add couple of links and break a long sentence into two at the end, with a minor rewording for clarity.

@andreidan
Copy link
Copy Markdown
Contributor Author

Failure was #141734

Copy link
Copy Markdown
Contributor

@spinscale spinscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few minor comments, looking forward to the functionality!

this.taskManager = taskManager;
this.threadPool = threadPool;

this.enabled = ENABLED.get(settings);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't all of these called with the settings update consumer anyway, so no need to call twice?

Copy link
Copy Markdown
Contributor Author

@andreidan andreidan Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. What are we calling twice?
We initialize enabled here and then subscribe to changes a bit later in the cosntructor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps using intializeAndWatch is what you meant here? d1a0d29

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!


private void setCoordinatorThreshold(long newCoordinatorThresholdValue) {
this.coordinatorThresholdNanos = newCoordinatorThresholdValue;
this.minThresholdNanos = computeMinThreshold(newCoordinatorThresholdValue, dataNodeThresholdNanos);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to pass variables, as you set this.coordinatorThresholdNanos here already and this.dataNodeThresholdNanos you can just use these in computeMinThreshold()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more readable with parameters (i.e. it conveys what it does without having to step inside the method)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, for me it was the opposite. As we're essentially testing this.coordinatorThresholdNanos and this.dataNodeThresholdNanos, but passing different variable names, I considered it harder to read. The name conveys to me what it does.

I don't have a strong preference though.

@andreidan andreidan requested a review from spinscale February 23, 2026 11:22
@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Feb 23, 2026
@andreidan andreidan removed the serverless-linked Added by automation, don't add manually label Feb 23, 2026
@andreidan andreidan merged commit c377cee into elastic:main Feb 23, 2026
35 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Feb 23, 2026
…on-sliced-reindex

* upstream/main: (110 commits)
  Add search task watchdog to log hot threads on slow search (elastic#142746)
  Fix return_intermediate_results query param on get async search results (elastic#142875)
  Mute org.elasticsearch.compute.operator.exchange.BatchDriverTests testSinglePageSingleBatch elastic#142895
  Cancel reindex body always has status (elastic#142766)
  Fix built-in roles sync losing updates (elastic#142433)
  ESQL: Clarify docs and add csv test for WHERE in STATS (elastic#133629)
  Fix and unmute ReindexResumeIT (elastic#142788)
  Fix broken release notes
  Mute org.elasticsearch.benchmark.vector.scorer.VectorScorerOSQBenchmarkTests testSingleScalarVsVectorized {p0=384 p1=4 p2=NIO p3=COSINE} elastic#142883
  ES|QL: fix Generative tests for commands that don't change the output schema (elastic#142864)
  Mute org.elasticsearch.benchmark.vector.scorer.VectorScorerOSQBenchmarkTests testSingleScalarVsVectorized {p0=1024 p1=1 p2=NIO p3=DOT_PRODUCT} elastic#142881
  SQL: Fix QlIllegalArgumentException with non-foldable date range queries (elastic#142386)
  Add more errors to the allowed_errors with github issue links (elastic#142862)
  ESQL: reapply "NDJSON datasource" (elastic#142855)
  Add implementation to update service settings method for Alibaba Cloud Search service (elastic#142738)
  Mute org.elasticsearch.snapshots.SnapshotShutdownIT testStartRemoveNodeButDoNotComplete elastic#142871
  Mute org.elasticsearch.snapshots.SnapshotShutdownIT testDeleteSnapshotWithPausedShardSnapshots elastic#142870
  Mute org.elasticsearch.snapshots.SnapshotShutdownIT testAbortSnapshotWhileRemovingNode elastic#142869
  Mute org.elasticsearch.snapshots.SnapshotShutdownIT testRemoveNodeDuringSnapshot elastic#142868
  ES|QL: Guard exponential_histogram TO_STRING against too large inputs (elastic#140718)
  ...
jdconrad pushed a commit to jdconrad/elasticsearch that referenced this pull request Feb 24, 2026
…42746)

Introduces an opt-in watchdog that logs hot threads when search tasks
exceed configurable time thresholds. Each node monitors its own tasks
via TaskManager, avoiding cross-node coordination complexity.
Data nodes log when shard-level tasks (query/fetch) exceed threshold.
Coordinators log only when the reduce/merge phase is slow, detected by
checking that all child tasks have completed before logging.

This introduces the following settings:
- search.task_watchdog.enabled (default: false)
- search.task_watchdog.coordinator_threshold (default: 3s)
- search.task_watchdog.data_node_threshold (default: 3s)
- search.task_watchdog.interval (default: 1s)
- search.task_watchdog.cooldown_period (default: 30s)
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Feb 24, 2026
…42746)

Introduces an opt-in watchdog that logs hot threads when search tasks
exceed configurable time thresholds. Each node monitors its own tasks
via TaskManager, avoiding cross-node coordination complexity.
Data nodes log when shard-level tasks (query/fetch) exceed threshold.
Coordinators log only when the reduce/merge phase is slow, detected by
checking that all child tasks have completed before logging.

This introduces the following settings:
- search.task_watchdog.enabled (default: false)
- search.task_watchdog.coordinator_threshold (default: 3s)
- search.task_watchdog.data_node_threshold (default: 3s)
- search.task_watchdog.interval (default: 1s)
- search.task_watchdog.cooldown_period (default: 30s)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants