[ML] Improve cleanup for model snapshot upgrades#81831
[ML] Improve cleanup for model snapshot upgrades#81831droberts195 merged 7 commits intoelastic:masterfrom
Conversation
If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Fixes elastic#81578
|
Pinging @elastic/ml-core (Team:ML) |
benwtrent
left a comment
There was a problem hiding this comment.
Other than my comments, this looks good go to.
| PARSER.declareString((request, jobId) -> request.jobId = jobId, Job.ID); | ||
| PARSER.declareString((request, snapshotId) -> request.snapshotId = snapshotId, SNAPSHOT_ID); |
There was a problem hiding this comment.
| PARSER.declareString((request, jobId) -> request.jobId = jobId, Job.ID); | |
| PARSER.declareString((request, snapshotId) -> request.snapshotId = snapshotId, SNAPSHOT_ID); | |
| PARSER.declareString(Request::setJobId, Job.ID); | |
| PARSER.declareString(Request::setSnapshotId, SNAPSHOT_ID); |
This seems OK as we are not allowing null in the parser declarations
| public class CancelJobModelSnapshotUpgradeAction extends ActionType<CancelJobModelSnapshotUpgradeAction.Response> { | ||
|
|
||
| public static final CancelJobModelSnapshotUpgradeAction INSTANCE = new CancelJobModelSnapshotUpgradeAction(); | ||
| public static final String NAME = "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel"; |
There was a problem hiding this comment.
| public static final String NAME = "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel"; | |
| public static final String NAME = "cluster:internal/xpack/ml/job/model_snapshots/upgrade/cancel"; |
Since there is no rest layer for this, it seems prudent to be internal.
There was a problem hiding this comment.
I intentionally made this admin instead of internal. Although there's no REST layer today it's possible that we might need to add one in the future, say if we ever want the UI to be able to clean up a failed upgrade.
The reason we've used internal instead of admin in the past is to make it hard for users of the transport client to call the API, even though it's easy for them to write the code to call it. Since there's nothing inherently wrong with calling this endpoint if you want to clean up failed upgrades I don't think that's a problem in this case. (Plus it's only in 7.17 that the transport client will even exist - after that lack of a REST layer will make this impossible to call.) I will add a comment to say this.
Part of the reason I didn't add a REST layer in this PR was because it's being added so late in the 7.x development cycle and it's not strictly essential at the moment.
There was a problem hiding this comment.
internal conveys a message even if just a convention. If this was to become a public API with a REST interface we would create a new admin wrapper action that calls this one and have this action internal. We have many internal APIs called this way.
I don't consider this a blocker on the PR but there is a path for making internal APIs public, if you think it likely there will be a REST API then keep it admin otherwise it makes more sense to be internal.
There was a problem hiding this comment.
I think it's 50/50 whether this will be made public in the future.
If this was to become a public API with a REST interface we would create a new
adminwrapper action that calls this one and have this actioninternal.
This seems like needless baggage to me. The wrapper action would literally call the internal action and do nothing else. It would just be extra boilerplate code.
internalconveys a message even if just a convention.
It's only the ML team that has ever done this. All the internal actions of core Elasticsearch start with either admin or monitor. (For example look at the persistent tasks actions.) The original reason why the ML team did this was to make transport client users think twice before calling an API that was intended to be internal. Of course this only had an effect if security was enabled, but it was in the days of X-Pack as a separate plugin where ML users were likely to be using security but most users weren't using security. It was also in the days before the transport client was deprecated so we were more worried about transport client users accidentally automating things they shouldn't.
Be as I said in the comment I added to the code, in this case we wouldn't really care if a transport client user did call this action. It doesn't do something low level that will mess up the cluster if called at the wrong time. It's an action that could be external, just isn't at the moment.
There was a problem hiding this comment.
I agree that as the transport client is not used directly any more we don't really gain much by calling the action internal.
There was a problem hiding this comment.
Its good we had this discussion then as internal is a convention we have been using for a long time. If we chose to stop doing this then we have already established the usage of naming internal actions internal e.g. InternalInferModelAction perhaps we can switch to that convention for actions that are purely internal.
There was a problem hiding this comment.
The key difference with this action is that it doesn't need to be internal. It should be possible to make it external in the future with as little extra effort as possible. Whereas InternalInferModelAction is specifically designed to be internal, and it would cause problems if people ever started calling it externally. An action that's more similar to this one is DeleteExpiredDataAction, where its primary purpose is to be called by the daily maintenance task, but it doesn't hurt to call it externally. DeleteExpiredDataAction does have a corresponding REST endpoint, but if we'd wanted to avoid that initially to avoid yet another endpoint that shows up in our list of documented endpoints then we could have deferred adding the REST endpoint to a release after the action was added.
There was a problem hiding this comment.
It's only the ML team that has ever done this. All the internal actions of core Elasticsearch start with either admin or monitor.
I agree that as the transport client is not used directly any more we don't really gain much by calling the action internal.
I was replying to these points stating that the original reason for using internal no longer applies and that and suggesting it can be discontinued with the demise of the transport client.
There was a problem hiding this comment.
I was replying to these points stating that the original reason for using
internalno longer applies and that and suggesting it can be discontinued with the demise of the transport client.
Yes, sure, for new actions we can discontinue it. But for the existing ones that contain internal changing it is a non-trivial piece of work, so we might as well leave them as-is.
| "cluster:admin/xpack/ml/job/model_snapshots/revert", | ||
| "cluster:admin/xpack/ml/job/model_snapshots/update", | ||
| "cluster:admin/xpack/ml/job/model_snapshots/upgrade", | ||
| "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel", |
There was a problem hiding this comment.
| "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel", | |
| "cluster:internal/xpack/ml/job/model_snapshots/upgrade/cancel", |
| super(NAME, Response::new); | ||
| } | ||
|
|
||
| public static class Request extends BaseTasksRequest<Request> implements ToXContentObject { |
There was a problem hiding this comment.
The request is a BaseTasksRequest yet the transport action is HandledTransportAction not a TransportTasksAction.
| public static class Request extends BaseTasksRequest<Request> implements ToXContentObject { | |
| public static class Request extends ActionRequest<Request> implements ToXContentObject { |
| // cannot be as thorough. | ||
| List<PersistentTasksCustomMetadata.PersistentTask<?>> upgradeTasksToCancel = MlTasks.snapshotUpgradeTasks(tasksInProgress) | ||
| .stream() | ||
| .filter(t -> jobIds.contains(((SnapshotUpgradeTaskParams) t.getParams()).getJobId())) |
There was a problem hiding this comment.
Why not use a SimpleIdsMatcher for job IDs and save a call to expandJobs
There was a problem hiding this comment.
expandJobs handles allow_no_match. It's impossible in this action for allow_no_match to work with the snapshot IDs, because the upgrade tasks are ephemeral and stopping a stopped thing is not an error. But allow_no_match can work with the job IDs.
This also ties in with #81831 (comment) because if we were only ever going to use this action internally then it wouldn't be so important to support allow_no_match at all. But I can imagine in the future we might want to call this action from an enhanced UI. We'll need to add a REST endpoint at that time and it's nice if the machinery of allow_no_match is available. The way allow_no_match is handled for jobs in this action is the same as in the recently added model snapshot upgrades stats action.
There was a problem hiding this comment.
There is a difference to the GET stats API which enforces that the snapshot IDs are also matched.
There was a problem hiding this comment.
I think what I've done is as sensible as it can be given the way the persistent tasks appear and disappear.
Consider these scenarios:
1
- User asks for stats for job foo, upgrade of snapshot 1
- Job foo doesn't exist
- We return not found
- This isn't controversial
2
- User asks for stats for job foo, upgrade of snapshot 1
- Job foo exists but snapshot 1 isn't being upgraded
- We return not found
- This one is potentially controversial, as the snapshot upgrade might only just have finished, but since it's not running we cannot return any stats. The only things we could do to make this better would be (1) persist stats for snapshot upgrades that have finished or (2) make upgrading a snapshot into a multi-step process where you first open the upgrade, then later close it (like you do for jobs). Both of these things bring considerable extra baggage to the upgrade process.
3
- User asks for stats for job foo, upgrade of snapshot 1
- Job foo exists and snapshot 1 is being upgraded
- We return the stats
- This isn't controversial
4
- User asks to cancel upgrade of snapshot 1 for job foo
- Job foo doesn't exist
- We return not found
- This isn't controversial
5
- User asks to cancel upgrade of snapshot 1 for job foo
- Job foo exists but snapshot 1 isn't being upgraded
- We do nothing and return success
- This one is potentially controversial, but the end state is what the user wanted and the snapshot upgrade might have just finished in between the user listing the upgrades in progress and asking to cancel this one. This is also consistent with closing a closed job is not an error and stopping a stopped datafeed is not an error.
6
- User asks to cancel upgrade of snapshot 1 for job foo
- Job foo exists and snapshot 1 is being upgraded
- We cancel the upgrade and return success
- This isn't controversial
I don't think there's a better way.
If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Backport of elastic#81831
* [7.17] [ML] Improve cleanup for model snapshot upgrades If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Backport of #81831 * Formatting
If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Backport of #81831
If a model snapshot upgrade persistent task is cancelled then
we now kill any associated C++ process. Previously the C++ process
could hang indefinitely.
Additionally, ML feature reset now cancels any in-progress model
snapshot upgrades before cleaning up job data, and deleting an
anomaly detection job cancels any in-progress model snapshot
upgrades associated with that job before cleaning up the job's
data.
Fixes #81578