[ML] Wait for _infer to work after restart in full cluster restart tests. #93327

davidkyle · 2023-01-27T17:16:42Z

After a full cluster restart a model's deployment status is reported as started if it was started before the restart. This can cause a temporary inconsistency while the model is starting up again. Eventually the model either starts and the started status is now correct or the model deployment fails the status is updated with a failure reason.

Rolling upgrades do not suffer the same problem as the trained model allocator notices nodes disappearing/reappearing updates the model's status at that point.

Closes #93325

elasticsearchmachine · 2023-01-27T17:17:07Z

Pinging @elastic/ml-core (Team:ML)

przemekwitek

LGTM

droberts195 · 2023-01-30T09:54:10Z

I agree that this will make the test work repeatably, but I do wonder if end users could run into the underlying problem once we go GA and users start pushing the functionality harder. Please can you open an issue to investigate what happens if a client (say Filebeat) is repeatedly trying to ingest data through the full cluster restart, and that data is going into an ingest pipeline that contains an ingest processor and hence doesn't work for a few seconds immediately after the full cluster restart. In this scenario the Filebeat will have been getting failures to connect to Elasticsearch and buffering up logs to be ingested. But then when the cluster comes back I imagine there'll be a period when instead of failure to connect it gets an error from the ingest processor. It would be good to check that this doesn't cause data loss.

The other thing is that following the restart some of the indices may not be available for a period. Some clients will check for that by waiting for yellow status. It makes me think we should try to integrate some form of "yellow status" for models into the cluster health API, so that eventually clients will be able to wait for that at the same time as waiting for yellow status on the indices.

I am not suggesting any of this should be dealt with in this PR - those things are more like research projects that potentially involve adding new hook points into core Elasticsearch. But please can you open a new issue to track the problem, since the original test failure issue is going to get closed when this PR is merged.

DaveCTurner · 2023-01-31T09:34:45Z

I imagine there'll be a period when instead of failure to connect it gets an error from the ingest processor. ... those things are more like research projects that potentially involve adding new hook points into core Elasticsearch

It may help to note that in a full cluster restart all the ClusterState.Custom objects are lost, while the Metadata.Custom ones are not. One possible solution might therefore be to install a simple ClusterState.Custom once the system is under control after starting up.

davidkyle · 2023-01-31T11:59:02Z

One possible solution might therefore be to install a simple ClusterState.Custom once the system is under control after starting up.

I had considered this problem intractable as the node had no way of knowing that it has just undergone a full cluster restart and that therefore the model deployment status in the clusterstate is no longer representative.

I've opened #93377 to describe the problem in more detail, I'm hopeful using a ClusterState.Custom could solve this issue in a BWC safe way.

To be clear the problem here is the mis-reported status which clients can't reliably use to determine readiness. Data loss and availability during a full cluster restart are architectural issues. Ingest pipelines should always send failures to a failed- index so that data is not lost.

…sts (elastic#93327)

…sts (#93327) (#93566) Backport of #93327 Closes #93529

Wait for to work after restart

c2c5d09

davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.7.0 labels Jan 27, 2023

elasticsearchmachine added the Team:ML Meta label for the ML team label Jan 27, 2023

davidkyle mentioned this pull request Jan 27, 2023

Convert full cluster restart tests to new rest testing framework #93062

Merged

przemekwitek approved these changes Jan 30, 2023

View reviewed changes

davidkyle merged commit bb50a65 into elastic:main Jan 31, 2023

davidkyle deleted the wait-infer branch January 31, 2023 11:51

mark-vieira pushed a commit to mark-vieira/elasticsearch that referenced this pull request Jan 31, 2023

[ML] Wait for _infer to work after restart in full cluster restart te…

0603827

…sts (elastic#93327)

davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Feb 7, 2023

[ML] Wait for _infer to work after restart in full cluster restart te…

1901095

…sts (elastic#93327)

davidkyle mentioned this pull request Feb 7, 2023

[ML] Wait for _infer to work after restart in full cluster restart tests #93566

Merged

elasticsearchmachine pushed a commit that referenced this pull request Feb 7, 2023

[ML] Wait for _infer to work after restart in full cluster restart te…

ef118b6

…sts (#93327) (#93566) Backport of #93327 Closes #93529

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Wait for _infer to work after restart in full cluster restart tests. #93327

[ML] Wait for _infer to work after restart in full cluster restart tests. #93327

Uh oh!

davidkyle commented Jan 27, 2023

Uh oh!

elasticsearchmachine commented Jan 27, 2023

Uh oh!

przemekwitek left a comment

Uh oh!

droberts195 commented Jan 30, 2023

Uh oh!

DaveCTurner commented Jan 31, 2023

Uh oh!

davidkyle commented Jan 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[ML] Wait for _infer to work after restart in full cluster restart tests. #93327

[ML] Wait for _infer to work after restart in full cluster restart tests. #93327

Uh oh!

Conversation

davidkyle commented Jan 27, 2023

Uh oh!

elasticsearchmachine commented Jan 27, 2023

Uh oh!

przemekwitek left a comment

Choose a reason for hiding this comment

Uh oh!

droberts195 commented Jan 30, 2023

Uh oh!

DaveCTurner commented Jan 31, 2023

Uh oh!

davidkyle commented Jan 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants