Include clusterApplyListener in long cluster apply warnings#120087
Include clusterApplyListener in long cluster apply warnings#120087nicktindall merged 6 commits intoelastic:mainfrom
Conversation
| public long rawRelativeTimeInMillis() { | ||
| assertThat(Thread.currentThread().getName(), containsString(ClusterApplierService.CLUSTER_UPDATE_THREAD_NAME)); | ||
| return currentTimeMillis; | ||
| } |
There was a problem hiding this comment.
I notice that we use rawRelativeTimeInMillis in the Recorder but the threshold is in the order of seconds, I wonder if we could use relativeTimeInMillis instead (to reduce the calls to System.nanoTime())
There was a problem hiding this comment.
I guess if the cluster is under heavy GC or the cachedTimer thread is experiencing starvation, the elapsed calcuation will be off if we use the cached timer. This can be detected by log message "timer thread slept ...". But it's a bit indirect. Not sure if that was the intention behind using rawRelativeTimeInMillis or may I am just overly trying to explain it. I noticed we use it in places like MasterService, In/OutBoundHandlers and HttpTransport etc. But not in places such as PersistedClusterStateService or FsHealthService. Not sure whether they are intentional or accidental either.
There was a problem hiding this comment.
The cached timer only updates every 200ms or so, and although the total threshold is many seconds in length many of the individual steps we record should take much less than 200ms. It's not a huge deal to call System::nanoTime here.
| newClusterState = updateFunction.apply(previousClusterState); | ||
| } | ||
| } catch (Exception e) { | ||
| timedListener.onFailure(e); |
There was a problem hiding this comment.
If the listener throws, the following code will not run. Probably shouldn't happen in practice. But maybe still worthwhile to wrap in try-finally?
I also wonder whether we should add more details in the log message for the time spent on applying the cluster and calling the listener?
There was a problem hiding this comment.
The TimedListener will measure the time spent in the listener. e.g. from the tests
cluster state applier task [test4] took [36s] which is above the warn threshold of [30s]: [running task [test4]] took [0ms], [listener.onResponse] took [36000ms]
Is that what you meant by "time spent ... calling the listener"
I will look at using one of the ActionListener.... utils or base classes to make exception handling more robust.
There was a problem hiding this comment.
I think Yang is right, we just need a try/finally here. TBH it's an error for this listener to throw, we should probably assert that too.
There was a problem hiding this comment.
Actually no real need to propagate the exception to the caller either, it just bubbles up to the unhandled exception handler which logs it and drops it. We may as well catch and log (and assert) in TimedListener.
There was a problem hiding this comment.
The ClusterApplyActionListener already wraps the provided listener to prevent onFailure exceptions from propagating, but I added similar logic to TimedListener#onFailure just in case that changes and we end up with a less safe listener being passed in.
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @nicktindall, I've created a changelog YAML for you. |
| try (Releasable ignored = recorder.record("listener.onFailure")) { | ||
| listener.onFailure(e); | ||
| } catch (Exception inner) { | ||
| inner.addSuppressed(e); |
There was a problem hiding this comment.
I think generally we'd suppress the second-thrown exception and propagate the first.
server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
Show resolved
Hide resolved
#120087 moved the assertion to the beginning of the catch block, before the warning log message. To ease future development, this PR moves the assertion back, so the warning will be available in the logs if a dev unexpectedly runs into this assertion.
Expands the long cluster apply warnings to include execution of
clusterApplyListener.(onResponse|onFailure).Relates: ES-10249