Send max of two types of max queue latency to ClusterInfo#132675
Send max of two types of max queue latency to ClusterInfo#132675elasticsearchmachine merged 15 commits intoelastic:mainfrom
Conversation
The TransportNodeUsageStatsForThreadPoolsAction now takes the max latency of any task currently queued in the write thread pool queue AND the previously collected max queue latency of any task dequeued since the last call. This covers the possibility that queue times can rise greatly before being reflected in execution: imagine all the write threads are stalled. Adds additional IT testing to exercise both forms of queue latency, a followup for ES-12316.
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @DiannaHohensee, I've created a changelog YAML for you. |
| * that is returned, which waits for total-write-threads + 1 callers. The caller can release the tasks by calling | ||
| * {@code barrier.await()} or interrupt them with {@code barrier.reset()}. | ||
| */ | ||
| public CyclicBarrier blockDataNodeIndexing(String dataNodeName) { |
There was a problem hiding this comment.
I stole this from some serverless testing.
| ); | ||
| } | ||
|
|
||
| public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException { |
There was a problem hiding this comment.
Pulling this from another test file, so it can be reused.
There was a problem hiding this comment.
I think we do not need to reuse this - so prefer to keep it local instead. Ideally we'd not have such unqualified waits, but we can look at that separately.
henningandersen
left a comment
There was a problem hiding this comment.
Left a number of comments, otherwise looks good.
I wonder if we really need this in a first version, but now it is here, it is fine to get in.
.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
| ); | ||
| } | ||
|
|
||
| public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException { |
There was a problem hiding this comment.
I think we do not need to reuse this - so prefer to keep it local instead. Ideally we'd not have such unqualified waits, but we can look at that separately.
…e write thread pool: eliminate replicas requiring sync
DiannaHohensee
left a comment
There was a problem hiding this comment.
Thanks for the review, I've addressed the feedback.
I also realized that the peek() method was returning Nanos, not Millis, so I fixed that (a23c716).
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
| ); | ||
| } | ||
|
|
||
| public static void waitForTimeToElapse(long elapsedMillis) throws InterruptedException { |
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/ClusterInfoServiceIT.java
Outdated
Show resolved
Hide resolved
.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java
Show resolved
Hide resolved
…must wait for 1ms, not 1ns, to see a change in queue latency
…2675) The TransportNodeUsageStatsForThreadPoolsAction now takes the max latency of any task currently queued in the write thread pool queue AND the previously collected max queue latency of any task dequeued since the last call. This covers the possibility that queue times can rise greatly before being reflected in execution: imagine all the write threads are stalled or have long running tasks. This action feeds a max queue latency stat to the ClusterInfo. Follow up from ES-12233. Adds additional IT testing to exercise both forms of queue latency, a followup for ES-12316. ------------- Completing the follow up testing [Henning requested previously](https://github.com/elastic/elasticsearch/pull/131480/files#r2243610807).
The TransportNodeUsageStatsForThreadPoolsAction now
takes the max latency of any task currently queued
in the write thread pool queue AND the previously
collected max queue latency of any task dequeued
since the last call. This covers the possibility
that queue times can rise greatly before being
reflected in execution: imagine all the write
threads are stalled or have long running tasks.
This action feeds a max queue latency stat to
the ClusterInfo. Follow up from ES-12233.
Adds additional IT testing to exercise both
forms of queue latency, a followup for ES-12316.
Completing the follow up testing Henning requested previously.