-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[ML] Wait for _infer to work after restart in full cluster restart tests. #93327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/ml-core (Team:ML) |
przemekwitek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I agree that this will make the test work repeatably, but I do wonder if end users could run into the underlying problem once we go GA and users start pushing the functionality harder. Please can you open an issue to investigate what happens if a client (say Filebeat) is repeatedly trying to ingest data through the full cluster restart, and that data is going into an ingest pipeline that contains an ingest processor and hence doesn't work for a few seconds immediately after the full cluster restart. In this scenario the Filebeat will have been getting failures to connect to Elasticsearch and buffering up logs to be ingested. But then when the cluster comes back I imagine there'll be a period when instead of failure to connect it gets an error from the ingest processor. It would be good to check that this doesn't cause data loss. The other thing is that following the restart some of the indices may not be available for a period. Some clients will check for that by waiting for yellow status. It makes me think we should try to integrate some form of "yellow status" for models into the cluster health API, so that eventually clients will be able to wait for that at the same time as waiting for yellow status on the indices. I am not suggesting any of this should be dealt with in this PR - those things are more like research projects that potentially involve adding new hook points into core Elasticsearch. But please can you open a new issue to track the problem, since the original test failure issue is going to get closed when this PR is merged. |
It may help to note that in a full cluster restart all the |
I had considered this problem intractable as the node had no way of knowing that it has just undergone a full cluster restart and that therefore the model deployment status in the clusterstate is no longer representative. I've opened #93377 to describe the problem in more detail, I'm hopeful using a To be clear the problem here is the mis-reported status which clients can't reliably use to determine readiness. Data loss and availability during a full cluster restart are architectural issues. Ingest pipelines should always send failures to a |
After a full cluster restart a model's deployment status is reported as started if it was started before the restart. This can cause a temporary inconsistency while the model is starting up again. Eventually the model either starts and the started status is now correct or the model deployment fails the status is updated with a failure reason.
Rolling upgrades do not suffer the same problem as the trained model allocator notices nodes disappearing/reappearing updates the model's status at that point.
Closes #93325