Skip to content

Comments

Fix readiness edge case on startup#140791

Merged
rjernst merged 7 commits intoelastic:mainfrom
rjernst:readiness/test_wait
Jan 21, 2026
Merged

Fix readiness edge case on startup#140791
rjernst merged 7 commits intoelastic:mainfrom
rjernst:readiness/test_wait

Conversation

@rjernst
Copy link
Member

@rjernst rjernst commented Jan 16, 2026

The readiness service watches for cluster state updates to determine when the node is ready. It must have a master node elected, and file settings must have been applied. Normally those take a little bit of time. However, the readiness services is setup to not even allow starting the tcp listener until after the service start method is called. This occurs in Node.start(), but after the initial node join. If the initial join occurs before ReadinessService.start(), and the cluster state already reflects the "ready" state, when ReadinessService.start() is called the service will be marked active, but no future cluster state update will reflect a change from "not ready" to "ready", so the tcp listener will never be started.

This commit adjusts the readiness service to keep track of the last state applied. It also moves adding the cluster state listener to when the service is started. Finally, it adjusts the readiness test helpers to use assertBusy instead of a hand rolled backoff loop.

closes #136955

The readiness service watches for cluster state updates to determine
when the node is ready. It must have a master node elected, and file
settings must have been applied. Normally those take a little bit of
time. However, the readiness services is setup to not even allow
starting the tcp listener until after the service start method is
called. This occurs in Node.start(), but _after_ the initial node join.
If the initial join occurs before start, and the state already reflects
the "ready" state, when start is called the service will be marked
active, but nothing will ever start the listener.

This commit adjusts the readiness service to keep track of the last
state applied. It also moves adding the cluster state listener to when
the service is started. Finally, it adjusts the readiness test helpers
to use assertBusy instead of a hand rolled backoff loop.

closes elastic#136955
@rjernst rjernst added >bug :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown auto-backport Automatically create backport pull requests when merged branch:9.2 branch:9.1 branch:8.19 branch:9.3 labels Jan 16, 2026
@rjernst rjernst requested a review from prdoyle January 16, 2026 01:04
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team v9.4.0 labels Jan 16, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Hi @rjernst, I've created a changelog YAML for you.

Copy link
Contributor

@prdoyle prdoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I take your word for it, but I don't entirely follow the PR description.

If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will be marked active, but nothing will ever start the listener.

The listener used to be set up at the end of the ReadinessService constructor. I don't understand why that's not sufficient?

);
clusterService = mock(ClusterService.class);
when(clusterService.lifecycleState()).thenReturn(Lifecycle.State.STARTED);
when(clusterService.state()).thenReturn(emptyState());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Aw, a real ClusterService replaced with a mock. Oh well, I see why you did it.)


throw new AssertionError("Readiness socket should be open");
public static void tcpReadinessProbeTrue(ReadinessService readinessService) throws Exception {
assertBusy(() -> assertTrue("Readiness socket should be open", socketIsOpen(readinessService)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent, right? Cool!

assertTrue(readinessService.ready());

readinessService.stop();
readinessService.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok that these won't run if the assertion fails? Usually this kind of stuff is done with @After.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is kind of fubar if the assertion fails, there's no guarantee stop/close won't themselves throw (for example from assertions in transitioning lifecycle states), and the other tests are currently written this way. Yes, they could leak a thread which will cause the class to fail, but the test already failed at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked these tests to rely on @After for stopping/closing the service, if necessary.

@@ -156,6 +159,11 @@ ServerSocketChannel setupSocket() {
protected void doStart() {
// Mark the service as active, we'll start the listener when ES is ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is still true. The listener here is the tcp listener, not the cluster state listener.

@rjernst
Copy link
Member Author

rjernst commented Jan 20, 2026

I don't entirely follow the PR description.

If the initial join occurs before start, and the state already reflects the "ready" state, when start is called the service will >> be marked active, but nothing will ever start the listener.

The listener used to be set up at the end of the ReadinessService constructor. I don't understand why that's not sufficient?

There is ambiguous terminology used throughout this class, unfortunately. There are two "listeners", the cluster state listener, and the tcp "listener". The last "start the listener" in my comment was talking about the tcp listener. I'll adjust the description to make this more clear.

@rjernst rjernst enabled auto-merge (squash) January 21, 2026 15:36
@rjernst rjernst merged commit 676e466 into elastic:main Jan 21, 2026
36 checks passed
rjernst added a commit to rjernst/elasticsearch that referenced this pull request Jan 21, 2026
The readiness service watches for cluster state updates to determine
when the node is ready. It must have a master node elected, and file
settings must have been applied. Normally those take a little bit of
time. However, the readiness services is setup to not even allow
starting the tcp listener until after the service start method is
called. This occurs in Node.start(), but _after_ the initial node join.
If the initial join occurs before start, and the state already reflects
the "ready" state, when start is called the service will be marked
active, but nothing will ever start the listener.

This commit adjusts the readiness service to keep track of the last
state applied. It also moves adding the cluster state listener to when
the service is started. Finally, it adjusts the readiness test helpers
to use assertBusy instead of a hand rolled backoff loop.

closes elastic#136955
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.3
9.1 Commit could not be cherrypicked due to conflicts
8.19 Commit could not be cherrypicked due to conflicts
9.2 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 140791

szybia added a commit to szybia/elasticsearch that referenced this pull request Jan 21, 2026
…-tests

* upstream/main: (104 commits)
  Partition time-series source (elastic#140475)
  Mute org.elasticsearch.xpack.esql.heap_attack.HeapAttackSubqueryIT testManyRandomKeywordFieldsInSubqueryIntermediateResultsWithSortManyFields elastic#141083
  Reindex relocation: skip nodes marked for shutdown (elastic#141044)
  Make fails on fixture caching not fail image building (elastic#140959)
  Add multi-project tests for get and list reindex (elastic#140980)
  Painless docs overhaul (reference) (elastic#137211)
  Panama vector implementation of codePointCount (elastic#140693)
  Enable PromQL in release builds (elastic#140808)
  Update rest-api-spec for Jina embedding task (elastic#140696)
  [CI] ShardSearchPhaseAPMMetricsTests testUniformCanMatchMetricAttributesWhenPlentyOfDocumentsInIndex failed (elastic#140848)
  Combine hash computation with bloom filter writes/reads (elastic#140969)
  Refactor posting iterators to provide more information (elastic#141058)
  Wait for cluster to recover to yellow before checking index health (elastic#141057) (elastic#141065)
  Fix repo analysis read count assertions (elastic#140994)
  Fixed a bug in logsdb rolling upgrade sereverless tests involving par… (elastic#141022)
  Fix readiness edge case on startup (elastic#140791)
  PromQL: fix quantile function (elastic#141033)
  ignore `mmr` command for check (in development) (elastic#140981)
  Use Double.compare to compare doubles in tdigest.Sort (elastic#141049)
  Migrate third party module tests using legacy test clusters framework (elastic#140991)
  ...
elasticsearchmachine pushed a commit that referenced this pull request Jan 23, 2026
The readiness service watches for cluster state updates to determine
when the node is ready. It must have a master node elected, and file
settings must have been applied. Normally those take a little bit of
time. However, the readiness services is setup to not even allow
starting the tcp listener until after the service start method is
called. This occurs in Node.start(), but _after_ the initial node join.
If the initial join occurs before start, and the state already reflects
the "ready" state, when start is called the service will be marked
active, but nothing will ever start the listener.

This commit adjusts the readiness service to keep track of the last
state applied. It also moves adding the cluster state listener to when
the service is started. Finally, it adjusts the readiness test helpers
to use assertBusy instead of a hand rolled backoff loop.

closes #136955
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending >bug :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown Team:Core/Infra Meta label for core/infra team v8.19.11 v9.1.11 v9.2.5 v9.3.1 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] ReadinessClusterIT testReadinessDuringRestartsNormalOrder failing

3 participants