Fix readiness edge case on startup by rjernst · Pull Request #140791 · elastic/elasticsearch

rjernst · 2026-01-16T01:04:33Z

The readiness service watches for cluster state updates to determine when the node is ready. It must have a master node elected, and file settings must have been applied. Normally those take a little bit of time. However, the readiness services is setup to not even allow starting the tcp listener until after the service start method is called. This occurs in Node.start(), but after the initial node join. If the initial join occurs before ReadinessService.start(), and the cluster state already reflects the "ready" state, when ReadinessService.start() is called the service will be marked active, but no future cluster state update will reflect a change from "not ready" to "ready", so the tcp listener will never be started.

This commit adjusts the readiness service to keep track of the last state applied. It also moves adding the cluster state listener to when the service is started. Finally, it adjusts the readiness test helpers to use assertBusy instead of a hand rolled backoff loop.

closes #136955

The readiness service watches for cluster state updates to determine when the node is ready. It must have a master node elected, and file settings must have been applied. Normally those take a little bit of time. However, the readiness services is setup to not even allow starting the tcp listener until after the service start method is called. This occurs in Node.start(), but _after_ the initial node join. If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will be marked active, but nothing will ever start the listener. This commit adjusts the readiness service to keep track of the last state applied. It also moves adding the cluster state listener to when the service is started. Finally, it adjusts the readiness test helpers to use assertBusy instead of a hand rolled backoff loop. closes elastic#136955

elasticsearchmachine · 2026-01-16T01:05:13Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2026-01-16T01:05:38Z

Hi @rjernst, I've created a changelog YAML for you.

prdoyle

Ok I take your word for it, but I don't entirely follow the PR description.

If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will be marked active, but nothing will ever start the listener.

The listener used to be set up at the end of the ReadinessService constructor. I don't understand why that's not sufficient?

prdoyle · 2026-01-19T15:34:57Z

server/src/test/java/org/elasticsearch/readiness/ReadinessServiceTests.java

-        );
+        clusterService = mock(ClusterService.class);
+        when(clusterService.lifecycleState()).thenReturn(Lifecycle.State.STARTED);
+        when(clusterService.state()).thenReturn(emptyState());


(Aw, a real ClusterService replaced with a mock. Oh well, I see why you did it.)

prdoyle · 2026-01-19T15:35:52Z

test/framework/src/main/java/org/elasticsearch/readiness/MockReadinessService.java

-
-        throw new AssertionError("Readiness socket should be open");
+    public static void tcpReadinessProbeTrue(ReadinessService readinessService) throws Exception {
+        assertBusy(() -> assertTrue("Readiness socket should be open", socketIsOpen(readinessService)));


This is equivalent, right? Cool!

prdoyle · 2026-01-19T15:36:59Z

server/src/test/java/org/elasticsearch/readiness/ReadinessServiceTests.java

+        assertTrue(readinessService.ready());
+
+        readinessService.stop();
+        readinessService.close();


Is it ok that these won't run if the assertion fails? Usually this kind of stuff is done with @After.

The test is kind of fubar if the assertion fails, there's no guarantee stop/close won't themselves throw (for example from assertions in transitioning lifecycle states), and the other tests are currently written this way. Yes, they could leak a thread which will cause the class to fail, but the test already failed at this point.

I reworked these tests to rely on @After for stopping/closing the service, if necessary.

prdoyle · 2026-01-19T15:37:45Z

server/src/main/java/org/elasticsearch/readiness/ReadinessService.java

@@ -156,6 +159,11 @@ ServerSocketChannel setupSocket() {
    protected void doStart() {
        // Mark the service as active, we'll start the listener when ES is ready


Is this comment still true?

Yes this is still true. The listener here is the tcp listener, not the cluster state listener.

rjernst · 2026-01-20T15:04:27Z

I don't entirely follow the PR description.

If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will >> be marked active, but nothing will ever start the listener.

The listener used to be set up at the end of the ReadinessService constructor. I don't understand why that's not sufficient?

There is ambiguous terminology used throughout this class, unfortunately. There are two "listeners", the cluster state listener, and the tcp "listener". The last "start the listener" in my comment was talking about the tcp listener. I'll adjust the description to make this more clear.

The readiness service watches for cluster state updates to determine when the node is ready. It must have a master node elected, and file settings must have been applied. Normally those take a little bit of time. However, the readiness services is setup to not even allow starting the tcp listener until after the service start method is called. This occurs in Node.start(), but _after_ the initial node join. If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will be marked active, but nothing will ever start the listener. This commit adjusts the readiness service to keep track of the last state applied. It also moves adding the cluster state listener to when the service is started. Finally, it adjusts the readiness test helpers to use assertBusy instead of a hand rolled backoff loop. closes elastic#136955

elasticsearchmachine · 2026-01-21T16:28:41Z

💔 Backport failed

Status	Branch	Result
✅	9.3
❌	9.1	Commit could not be cherrypicked due to conflicts
❌	8.19	Commit could not be cherrypicked due to conflicts
❌	9.2	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 140791

…-tests * upstream/main: (104 commits) Partition time-series source (elastic#140475) Mute org.elasticsearch.xpack.esql.heap_attack.HeapAttackSubqueryIT testManyRandomKeywordFieldsInSubqueryIntermediateResultsWithSortManyFields elastic#141083 Reindex relocation: skip nodes marked for shutdown (elastic#141044) Make fails on fixture caching not fail image building (elastic#140959) Add multi-project tests for get and list reindex (elastic#140980) Painless docs overhaul (reference) (elastic#137211) Panama vector implementation of codePointCount (elastic#140693) Enable PromQL in release builds (elastic#140808) Update rest-api-spec for Jina embedding task (elastic#140696) [CI] ShardSearchPhaseAPMMetricsTests testUniformCanMatchMetricAttributesWhenPlentyOfDocumentsInIndex failed (elastic#140848) Combine hash computation with bloom filter writes/reads (elastic#140969) Refactor posting iterators to provide more information (elastic#141058) Wait for cluster to recover to yellow before checking index health (elastic#141057) (elastic#141065) Fix repo analysis read count assertions (elastic#140994) Fixed a bug in logsdb rolling upgrade sereverless tests involving par… (elastic#141022) Fix readiness edge case on startup (elastic#140791) PromQL: fix quantile function (elastic#141033) ignore `mmr` command for check (in development) (elastic#140981) Use Double.compare to compare doubles in tdigest.Sort (elastic#141049) Migrate third party module tests using legacy test clusters framework (elastic#140991) ...

The readiness service watches for cluster state updates to determine when the node is ready. It must have a master node elected, and file settings must have been applied. Normally those take a little bit of time. However, the readiness services is setup to not even allow starting the tcp listener until after the service start method is called. This occurs in Node.start(), but _after_ the initial node join. If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will be marked active, but nothing will ever start the listener. This commit adjusts the readiness service to keep track of the last state applied. It also moves adding the cluster state listener to when the service is started. Finally, it adjusts the readiness test helpers to use assertBusy instead of a hand rolled backoff loop. closes #136955

rjernst added 2 commits January 15, 2026 16:46

add test

f3f61a9

rjernst added >bug :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown auto-backport Automatically create backport pull requests when merged branch:9.2 branch:9.1 branch:8.19 branch:9.3 labels Jan 16, 2026

rjernst requested a review from prdoyle January 16, 2026 01:04

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team v9.4.0 labels Jan 16, 2026

elasticsearchmachine added v9.3.1 v9.1.11 v8.19.11 v9.2.5 and removed branch:9.2 branch:9.1 branch:8.19 branch:9.3 labels Jan 16, 2026

Update docs/changelog/140791.yaml

018e0c3

[CI] Auto commit changes from spotless

82083dc

rjernst mentioned this pull request Jan 16, 2026

[CI] ReadinessClusterIT testReadinessDuringRestartsNormalOrder failing #136955

Closed

tests

42d4c39

prdoyle approved these changes Jan 19, 2026

View reviewed changes

rjernst added 2 commits January 20, 2026 08:34

address feedback

2549a07

Merge branch 'main' into readiness/test_wait

76eea8c

rjernst enabled auto-merge (squash) January 21, 2026 15:36

rjernst merged commit 676e466 into elastic:main Jan 21, 2026
36 checks passed

rjernst mentioned this pull request Jan 21, 2026

[9.3] Fix readiness edge case on startup (#140791) #141063

Merged

elasticsearchmachine added the backport pending label Jan 21, 2026

		@@ -156,6 +159,11 @@ ServerSocketChannel setupSocket() {
		protected void doStart() {
		// Mark the service as active, we'll start the listener when ES is ready

Comments

Conversation

rjernst commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 16, 2026

Uh oh!

elasticsearchmachine commented Jan 16, 2026

Uh oh!

prdoyle left a comment

Choose a reason for hiding this comment

Uh oh!

prdoyle Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

prdoyle Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

prdoyle Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

rjernst Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

rjernst Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

prdoyle Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

rjernst Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

rjernst commented Jan 20, 2026

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 21, 2026

💔 Backport failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjernst commented Jan 16, 2026 •

edited

Loading