Skip to content

Conversation

@benwtrent
Copy link
Member

Failure count should not only be reset at checkpoints. Checkpoints could have many pages of data. Consequently, we should reset the failure count once we handle a single composite aggregation page.

This way, the transform won't mark itself as failed erroneously when it has actually succeeded searches + indexing results within the same checkpoint.

closes #76074

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Aug 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Comment on lines +150 to +158
if (bulkResponse.hasFailures() == false) {
// We don't know the of failures that have occurred (searching, processing, indexing, etc.),
// but if we search, process and bulk index then we have
// successfully processed an entire page of the transform and should reset the counter, even if we are in the middle
// of a checkpoint
context.resetReasonAndFailureCounter();
nextPhase.onResponse(bulkResponse);
return;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this diff is a pain. The only difference here is the resetReasonAndFailureCounter and the comments. I created a new method here so that this is testable.

this.auditor = transformServices.getAuditor();
this.transformConfig = ExceptionsHelper.requireNonNull(transformConfig, "transformConfig");
this.progress = progress != null ? progress : new TransformProgress();
this.progress = transformProgress != null ? transformProgress : new TransformProgress();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this seemed like a bug, progress was assigning to itself, which was never set? We should set it to the passed value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was broken in #75459, which is a 7.15 change, and explains why it hasn't caused a bug report from a user yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😌

@benwtrent benwtrent added v7.14.1 auto-backport Automatically create backport pull requests when merged labels Aug 11, 2021
Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit 70d91d1 into elastic:master Aug 11, 2021
@benwtrent benwtrent deleted the transforms/reset-failure-count-on-indexing-success branch August 11, 2021 14:59
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
7.14 Commit could not be cherrypicked due to conflicts
7.x Commit could not be cherrypicked due to conflicts

To backport manually run:
backport --pr 76355

benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 11, 2021
… is handled successfully (elastic#76355)

Failure count should not only be reset at checkpoints. Checkpoints could have many pages of data. Consequently, we should reset the failure count once we handle a single composite aggregation page.

This way, the transform won't mark itself as failed erroneously when it has actually succeeded searches + indexing results within the same checkpoint.

closes elastic#76074
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 11, 2021
… is handled successfully (elastic#76355)

Failure count should not only be reset at checkpoints. Checkpoints could have many pages of data. Consequently, we should reset the failure count once we handle a single composite aggregation page.

This way, the transform won't mark itself as failed erroneously when it has actually succeeded searches + indexing results within the same checkpoint.

closes elastic#76074
elasticsearchmachine pushed a commit that referenced this pull request Aug 11, 2021
…n page is handled successfully (#76355) (#76365)

* [ML][Transform] reset failure count when a transform aggregation page is handled successfully (#76355)

Failure count should not only be reset at checkpoints. Checkpoints could have many pages of data. Consequently, we should reset the failure count once we handle a single composite aggregation page.

This way, the transform won't mark itself as failed erroneously when it has actually succeeded searches + indexing results within the same checkpoint.

closes #76074

* fixing compilation
elasticsearchmachine pushed a commit that referenced this pull request Aug 11, 2021
…on page is handled successfully (#76355) (#76366)

* [ML][Transform] reset failure count when a transform aggregation page is handled successfully (#76355)

Failure count should not only be reset at checkpoints. Checkpoints could have many pages of data. Consequently, we should reset the failure count once we handle a single composite aggregation page.

This way, the transform won't mark itself as failed erroneously when it has actually succeeded searches + indexing results within the same checkpoint.

closes #76074

* fixing tests

* fixing tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :ml/Transform Transform Team:ML Meta label for the ML team v7.14.1 v7.15.0 v8.0.0-alpha2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Transform] Reset failure counter after 1st success after a failure

5 participants