Fix readiness edge case on startup#140791
Conversation
The readiness service watches for cluster state updates to determine when the node is ready. It must have a master node elected, and file settings must have been applied. Normally those take a little bit of time. However, the readiness services is setup to not even allow starting the tcp listener until after the service start method is called. This occurs in Node.start(), but _after_ the initial node join. If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will be marked active, but nothing will ever start the listener. This commit adjusts the readiness service to keep track of the last state applied. It also moves adding the cluster state listener to when the service is started. Finally, it adjusts the readiness test helpers to use assertBusy instead of a hand rolled backoff loop. closes elastic#136955
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
|
Hi @rjernst, I've created a changelog YAML for you. |
prdoyle
left a comment
There was a problem hiding this comment.
Ok I take your word for it, but I don't entirely follow the PR description.
If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will be marked active, but nothing will ever start the listener.
The listener used to be set up at the end of the ReadinessService constructor. I don't understand why that's not sufficient?
| ); | ||
| clusterService = mock(ClusterService.class); | ||
| when(clusterService.lifecycleState()).thenReturn(Lifecycle.State.STARTED); | ||
| when(clusterService.state()).thenReturn(emptyState()); |
There was a problem hiding this comment.
(Aw, a real ClusterService replaced with a mock. Oh well, I see why you did it.)
|
|
||
| throw new AssertionError("Readiness socket should be open"); | ||
| public static void tcpReadinessProbeTrue(ReadinessService readinessService) throws Exception { | ||
| assertBusy(() -> assertTrue("Readiness socket should be open", socketIsOpen(readinessService))); |
There was a problem hiding this comment.
This is equivalent, right? Cool!
| assertTrue(readinessService.ready()); | ||
|
|
||
| readinessService.stop(); | ||
| readinessService.close(); |
There was a problem hiding this comment.
Is it ok that these won't run if the assertion fails? Usually this kind of stuff is done with @After.
There was a problem hiding this comment.
The test is kind of fubar if the assertion fails, there's no guarantee stop/close won't themselves throw (for example from assertions in transitioning lifecycle states), and the other tests are currently written this way. Yes, they could leak a thread which will cause the class to fail, but the test already failed at this point.
There was a problem hiding this comment.
I reworked these tests to rely on @After for stopping/closing the service, if necessary.
| @@ -156,6 +159,11 @@ ServerSocketChannel setupSocket() { | |||
| protected void doStart() { | |||
| // Mark the service as active, we'll start the listener when ES is ready | |||
There was a problem hiding this comment.
Is this comment still true?
There was a problem hiding this comment.
Yes this is still true. The listener here is the tcp listener, not the cluster state listener.
There is ambiguous terminology used throughout this class, unfortunately. There are two "listeners", the cluster state listener, and the tcp "listener". The last "start the listener" in my comment was talking about the tcp listener. I'll adjust the description to make this more clear. |
The readiness service watches for cluster state updates to determine when the node is ready. It must have a master node elected, and file settings must have been applied. Normally those take a little bit of time. However, the readiness services is setup to not even allow starting the tcp listener until after the service start method is called. This occurs in Node.start(), but _after_ the initial node join. If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will be marked active, but nothing will ever start the listener. This commit adjusts the readiness service to keep track of the last state applied. It also moves adding the cluster state listener to when the service is started. Finally, it adjusts the readiness test helpers to use assertBusy instead of a hand rolled backoff loop. closes elastic#136955
💔 Backport failed
You can use sqren/backport to manually backport by running |
…-tests * upstream/main: (104 commits) Partition time-series source (elastic#140475) Mute org.elasticsearch.xpack.esql.heap_attack.HeapAttackSubqueryIT testManyRandomKeywordFieldsInSubqueryIntermediateResultsWithSortManyFields elastic#141083 Reindex relocation: skip nodes marked for shutdown (elastic#141044) Make fails on fixture caching not fail image building (elastic#140959) Add multi-project tests for get and list reindex (elastic#140980) Painless docs overhaul (reference) (elastic#137211) Panama vector implementation of codePointCount (elastic#140693) Enable PromQL in release builds (elastic#140808) Update rest-api-spec for Jina embedding task (elastic#140696) [CI] ShardSearchPhaseAPMMetricsTests testUniformCanMatchMetricAttributesWhenPlentyOfDocumentsInIndex failed (elastic#140848) Combine hash computation with bloom filter writes/reads (elastic#140969) Refactor posting iterators to provide more information (elastic#141058) Wait for cluster to recover to yellow before checking index health (elastic#141057) (elastic#141065) Fix repo analysis read count assertions (elastic#140994) Fixed a bug in logsdb rolling upgrade sereverless tests involving par… (elastic#141022) Fix readiness edge case on startup (elastic#140791) PromQL: fix quantile function (elastic#141033) ignore `mmr` command for check (in development) (elastic#140981) Use Double.compare to compare doubles in tdigest.Sort (elastic#141049) Migrate third party module tests using legacy test clusters framework (elastic#140991) ...
The readiness service watches for cluster state updates to determine when the node is ready. It must have a master node elected, and file settings must have been applied. Normally those take a little bit of time. However, the readiness services is setup to not even allow starting the tcp listener until after the service start method is called. This occurs in Node.start(), but _after_ the initial node join. If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will be marked active, but nothing will ever start the listener. This commit adjusts the readiness service to keep track of the last state applied. It also moves adding the cluster state listener to when the service is started. Finally, it adjusts the readiness test helpers to use assertBusy instead of a hand rolled backoff loop. closes #136955
The readiness service watches for cluster state updates to determine when the node is ready. It must have a master node elected, and file settings must have been applied. Normally those take a little bit of time. However, the readiness services is setup to not even allow starting the tcp listener until after the service start method is called. This occurs in Node.start(), but after the initial node join. If the initial join occurs before
ReadinessService.start(), and the cluster state already reflects the "ready" state, whenReadinessService.start()is called the service will be marked active, but no future cluster state update will reflect a change from "not ready" to "ready", so the tcp listener will never be started.This commit adjusts the readiness service to keep track of the last state applied. It also moves adding the cluster state listener to when the service is started. Finally, it adjusts the readiness test helpers to use assertBusy instead of a hand rolled backoff loop.
closes #136955