Add search task watchdog to log hot threads on slow search#142746
Add search task watchdog to log hot threads on slow search#142746andreidan merged 17 commits intoelastic:mainfrom
Conversation
Introduces an opt-in watchdog that logs hot threads when search tasks exceed configurable time thresholds. Each node monitors its own tasks via TaskManager, avoiding cross-node coordination complexity. Data nodes log when shard-level tasks (query/fetch) exceed threshold. Coordinators log only when the reduce/merge phase is slow, detected by checking that all child tasks have completed before logging. This introduces the following settings: - search.task_watchdog.enabled (default: false) - search.task_watchdog.coordinator_threshold (default: 3s) - search.task_watchdog.data_node_threshold (default: 3s) - search.task_watchdog.interval (default: 1s) - search.task_watchdog.cooldown_period (default: 30s)
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
|
Hi @andreidan, I've created a changelog YAML for you. |
🔍 Preview links for changed docs |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
leemthompo
left a comment
There was a problem hiding this comment.
Docs LGTM 👌. Made minor suggestions in opening paragraphs to add couple of links and break a long sentence into two at the end, with a minor rewording for clarity.
docs/reference/elasticsearch/configuration-reference/search-settings.md
Outdated
Show resolved
Hide resolved
|
Failure was #141734 |
spinscale
left a comment
There was a problem hiding this comment.
left a few minor comments, looking forward to the functionality!
| this.taskManager = taskManager; | ||
| this.threadPool = threadPool; | ||
|
|
||
| this.enabled = ENABLED.get(settings); |
There was a problem hiding this comment.
aren't all of these called with the settings update consumer anyway, so no need to call twice?
There was a problem hiding this comment.
I'm not sure I understand. What are we calling twice?
We initialize enabled here and then subscribe to changes a bit later in the cosntructor.
There was a problem hiding this comment.
Perhaps using intializeAndWatch is what you meant here? d1a0d29
|
|
||
| private void setCoordinatorThreshold(long newCoordinatorThresholdValue) { | ||
| this.coordinatorThresholdNanos = newCoordinatorThresholdValue; | ||
| this.minThresholdNanos = computeMinThreshold(newCoordinatorThresholdValue, dataNodeThresholdNanos); |
There was a problem hiding this comment.
no need to pass variables, as you set this.coordinatorThresholdNanos here already and this.dataNodeThresholdNanos you can just use these in computeMinThreshold()?
There was a problem hiding this comment.
I think it's more readable with parameters (i.e. it conveys what it does without having to step inside the method)
There was a problem hiding this comment.
haha, for me it was the opposite. As we're essentially testing this.coordinatorThresholdNanos and this.dataNodeThresholdNanos, but passing different variable names, I considered it harder to read. The name conveys to me what it does.
I don't have a strong preference though.
server/src/main/java/org/elasticsearch/action/search/SearchTaskWatchdog.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchTaskWatchdog.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchTaskWatchdog.java
Show resolved
Hide resolved
…on-sliced-reindex * upstream/main: (110 commits) Add search task watchdog to log hot threads on slow search (elastic#142746) Fix return_intermediate_results query param on get async search results (elastic#142875) Mute org.elasticsearch.compute.operator.exchange.BatchDriverTests testSinglePageSingleBatch elastic#142895 Cancel reindex body always has status (elastic#142766) Fix built-in roles sync losing updates (elastic#142433) ESQL: Clarify docs and add csv test for WHERE in STATS (elastic#133629) Fix and unmute ReindexResumeIT (elastic#142788) Fix broken release notes Mute org.elasticsearch.benchmark.vector.scorer.VectorScorerOSQBenchmarkTests testSingleScalarVsVectorized {p0=384 p1=4 p2=NIO p3=COSINE} elastic#142883 ES|QL: fix Generative tests for commands that don't change the output schema (elastic#142864) Mute org.elasticsearch.benchmark.vector.scorer.VectorScorerOSQBenchmarkTests testSingleScalarVsVectorized {p0=1024 p1=1 p2=NIO p3=DOT_PRODUCT} elastic#142881 SQL: Fix QlIllegalArgumentException with non-foldable date range queries (elastic#142386) Add more errors to the allowed_errors with github issue links (elastic#142862) ESQL: reapply "NDJSON datasource" (elastic#142855) Add implementation to update service settings method for Alibaba Cloud Search service (elastic#142738) Mute org.elasticsearch.snapshots.SnapshotShutdownIT testStartRemoveNodeButDoNotComplete elastic#142871 Mute org.elasticsearch.snapshots.SnapshotShutdownIT testDeleteSnapshotWithPausedShardSnapshots elastic#142870 Mute org.elasticsearch.snapshots.SnapshotShutdownIT testAbortSnapshotWhileRemovingNode elastic#142869 Mute org.elasticsearch.snapshots.SnapshotShutdownIT testRemoveNodeDuringSnapshot elastic#142868 ES|QL: Guard exponential_histogram TO_STRING against too large inputs (elastic#140718) ...
…42746) Introduces an opt-in watchdog that logs hot threads when search tasks exceed configurable time thresholds. Each node monitors its own tasks via TaskManager, avoiding cross-node coordination complexity. Data nodes log when shard-level tasks (query/fetch) exceed threshold. Coordinators log only when the reduce/merge phase is slow, detected by checking that all child tasks have completed before logging. This introduces the following settings: - search.task_watchdog.enabled (default: false) - search.task_watchdog.coordinator_threshold (default: 3s) - search.task_watchdog.data_node_threshold (default: 3s) - search.task_watchdog.interval (default: 1s) - search.task_watchdog.cooldown_period (default: 30s)
…42746) Introduces an opt-in watchdog that logs hot threads when search tasks exceed configurable time thresholds. Each node monitors its own tasks via TaskManager, avoiding cross-node coordination complexity. Data nodes log when shard-level tasks (query/fetch) exceed threshold. Coordinators log only when the reduce/merge phase is slow, detected by checking that all child tasks have completed before logging. This introduces the following settings: - search.task_watchdog.enabled (default: false) - search.task_watchdog.coordinator_threshold (default: 3s) - search.task_watchdog.data_node_threshold (default: 3s) - search.task_watchdog.interval (default: 1s) - search.task_watchdog.cooldown_period (default: 30s)
Introduces an opt-in watchdog that logs hot threads when search tasks exceed configurable time thresholds. Each node monitors its own tasks via TaskManager, avoiding cross-node coordination complexity. Data nodes log when shard-level tasks (query/fetch) exceed threshold. Coordinators log only when the reduce/merge phase is slow, detected by checking that all child tasks have completed before logging.
This introduces the following settings: