- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.6k
[ML] Wait for model process to stop in stop deployment #83644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Pinging @elastic/ml-core (Team:ML) | 
| Hi @davidkyle, I've created a changelog YAML for you. | 
|  | ||
| public void stop(String reason) { | ||
| logger.debug("[{}] Stopping due to reason [{}]", getModelId(), reason); | ||
| licensedFeature.stopTracking(licenseState, "model-" + params.getModelId()); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still needs to be called. If you are concerned, maybe wrap the listener and call this on response/failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not happy with this pattern. The task asks the node service to stop then the node service calls back to TrainedModelDeploymentTask::markAsStopped from stopDeploymentAsync. This means markAsStopped is a public method which doesn't make much sense as part of the public API.
The problem is there are a few ways TrainedModelAllocationNodeService::stopDeploymentAsync can be called such as the service noticing the deployment has been deleted or the task being cancelled.
I deleted these lines because  markAsStopped (formerly stopWithoutNotification) was being called anyway and the same work occurs there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted these lines because markAsStopped (formerly stopWithoutNotification) was being called anyway and the same work occurs there.
Gotcha, we just need to be careful and make sure that stopTracking is being called.
| listener.onResponse(new StopTrainedModelDeploymentAction.Response(true)); | ||
| task.stop( | ||
| "undeploy_trained_model (api)", | ||
| ActionListener.wrap(r -> listener.onResponse(new StopTrainedModelDeploymentAction.Response(true)), listener::onFailure) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a much cleaner and the execution path is more easily read.
| .prepareListTasks(nodesOfConcern.toArray(String[]::new)) | ||
| .setDetailed(true) | ||
| .setWaitForCompletion(true) | ||
| .setActions(modelId) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this bug was added in: #81259
The actions used to contain the model id. Regardless, the new stopping path is much cleaner.
| 💚 Backport successful
 | 
* upstream/master: (166 commits) Bind host all instead of just _site_ when needed (elastic#83145) [DOCS] Fix min/max agg snippets for histograms (elastic#83695) [DOCS] Add deprecation notice for system indices (elastic#83688) Cache ILM policy name on IndexMetadata (elastic#83603) [DOCS] Fix 8.0 breaking changes sort order (elastic#83685) [ML] fix random sampling background query consistency (elastic#83676) Move internal APIs into their own namespace '_internal' Runtime fields core-with-mapped tests support tsdb (elastic#83577) Optimize calculating the presence of a quorum (elastic#83638) Use switch expressions in EnableAllocationDecider and NodeShutdownAllocationDecider (elastic#83641) Note libffi error message in tmpdir docs (elastic#83662) Fix TransportDesiredNodesActionsIT batch tests (elastic#83659) [DOCS] Remove unused upgrade doc files (elastic#83617) [ML] Wait for model process to stop in stop deployment (elastic#83644) [ML] Fix submit after shutdown in process worker service (elastic#83645) Remove req/resp classes associated with HLRC (elastic#83599) Introduce index.version.compatibility setting (elastic#83264) Rename InternalTestCluster#getMasterNodeInstance (elastic#83407) Mute TimeSeriesIndexSearcherTests testCollectInOrderAcrossSegments (elastic#83648) Add rollover add max_primary_shard_docs condition (elastic#80981) ... # Conflicts: # x-pack/plugin/rollup/build.gradle # x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/v2/RollupActionSingleNodeTests.java
When stopping a deployment the tasks API was used to wait for the model task to finish but the action used in the request did not match the model task action so the request would return success without waiting. The code would then continue and delete the model allocation. Deleting the allocation would be noticed on the node running the task and if the task had not stopped yet it would then stop the task resulting in a double stop and an error being logged as this is unexpected. The bug is minor and it only manifests in the log files, the stop deployment is still successful.
The fix is to add a listener to the task stop API that responds only after the task is stopped.