Skip to content

Conversation

@davidkyle
Copy link
Member

After a full cluster restart a model's deployment status is reported as started if it was started before the restart. This can cause a temporary inconsistency while the model is starting up again. Eventually the model either starts and the started status is now correct or the model deployment fails the status is updated with a failure reason.

Rolling upgrades do not suffer the same problem as the trained model allocator notices nodes disappearing/reappearing updates the model's status at that point.

Closes #93325

@davidkyle davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.7.0 labels Jan 27, 2023
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jan 27, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195
Copy link

I agree that this will make the test work repeatably, but I do wonder if end users could run into the underlying problem once we go GA and users start pushing the functionality harder. Please can you open an issue to investigate what happens if a client (say Filebeat) is repeatedly trying to ingest data through the full cluster restart, and that data is going into an ingest pipeline that contains an ingest processor and hence doesn't work for a few seconds immediately after the full cluster restart. In this scenario the Filebeat will have been getting failures to connect to Elasticsearch and buffering up logs to be ingested. But then when the cluster comes back I imagine there'll be a period when instead of failure to connect it gets an error from the ingest processor. It would be good to check that this doesn't cause data loss.

The other thing is that following the restart some of the indices may not be available for a period. Some clients will check for that by waiting for yellow status. It makes me think we should try to integrate some form of "yellow status" for models into the cluster health API, so that eventually clients will be able to wait for that at the same time as waiting for yellow status on the indices.

I am not suggesting any of this should be dealt with in this PR - those things are more like research projects that potentially involve adding new hook points into core Elasticsearch. But please can you open a new issue to track the problem, since the original test failure issue is going to get closed when this PR is merged.

@davidkyle davidkyle merged commit bb50a65 into elastic:main Jan 31, 2023
@DaveCTurner
Copy link
Contributor

I imagine there'll be a period when instead of failure to connect it gets an error from the ingest processor. ... those things are more like research projects that potentially involve adding new hook points into core Elasticsearch

It may help to note that in a full cluster restart all the ClusterState.Custom objects are lost, while the Metadata.Custom ones are not. One possible solution might therefore be to install a simple ClusterState.Custom once the system is under control after starting up.

@davidkyle davidkyle deleted the wait-infer branch January 31, 2023 11:51
@davidkyle
Copy link
Member Author

One possible solution might therefore be to install a simple ClusterState.Custom once the system is under control after starting up.

I had considered this problem intractable as the node had no way of knowing that it has just undergone a full cluster restart and that therefore the model deployment status in the clusterstate is no longer representative.

I've opened #93377 to describe the problem in more detail, I'm hopeful using a ClusterState.Custom could solve this issue in a BWC safe way.

To be clear the problem here is the mis-reported status which clients can't reliably use to determine readiness. Data loss and availability during a full cluster restart are architectural issues. Ingest pipelines should always send failures to a failed- index so that data is not lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning Team:ML Meta label for the ML team >test Issues or PRs that are addressing/adding tests v8.7.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] MLModelDeploymentFullClusterRestartIT:: testDeploymentSurvivesRestart failure

5 participants