Skip to content

Conversation

@mikecote
Copy link
Contributor

@mikecote mikecote commented Aug 21, 2020

⚠️ This PR merges into a feature branch

Part of #65553.

Feature branch PR: #75666.

In this PR, I'm making the task manager store emit error events via its new property errors$ that can later be used to filter 429 errors in order to apply back pressure in the task manager. The fundamental change is to wrap each call to the savedObjectsRepository or to callCluster in a try catch and then push the error into the errors$ observable as well as the usual, throwing.

@mikecote mikecote added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// labels Aug 21, 2020
@mikecote mikecote self-assigned this Aug 21, 2020
@mikecote mikecote force-pushed the task-manager/429-store-events branch from 61fd9c0 to de40033 Compare August 27, 2020 17:05
@mikecote mikecote marked this pull request as ready for review August 31, 2020 16:40
@mikecote mikecote requested a review from a team as a code owner August 31, 2020 16:40
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

Copy link
Contributor

@YulNaumenko YulNaumenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, made some "design" kinda notes for alternative approaches, this seems fine so far tho. I'd forge ahead with writing the consumer of this new stream ... :-)

});
let result;
try {
result = await this.callCluster('search', {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the error stream will be a mixture of saved object errors and es errors; I wonder if we can pick out 429s out of there reasonably well :-). Alternatively, we could have some kind of object we could convert those two errors before next()ing them to make it easier on the consumer. I think wait to the next step of writing a consumer of this stream to figure out the best approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can pick out 429s out of there reasonably well

I enhanced the saved objects errors helper to hopefully facilitate this in #75664. To detect SOC 429 errors, it would be done via SavedObjectsErrorHelpers.isTooManyRequestsError(e). I'm not too sure yet how it will look on callCluster calls, I've noticed some include retry logic instead of throwing 429 errors so I'll have to see :-)

I think the same as you said, we'll see how it looks like at the next step and we can improve if needed.

taskInstanceToAttributes(taskInstance),
{ id: taskInstance.id, refresh: false }
);
let savedObject;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, SO client allows you write "middleware"; in theory, if it had everything we needed, we could use that to feed the error stream, without these explicit catches. We also have the callCluster usage below, it could be easy wrapped I think. Not sure it's worth it though.

Copy link
Contributor Author

@mikecote mikecote Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmuellr I'll investigate this a bit. I'm not sure how the wrappers work entirely but I'll check. I think it's either done globally for all instances of SOC (which may be what we're thinking?) or can be done at an instance level (which I don't think is supported but we could wrap it ourselves) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at how this could be done. The saved objects client exposes a addClientWrapper but it wraps every instance of SOC created in the Kibana instance (including non task manager related ones). This would proxy every call made to the SOC which sounds worrisome to me. With that finding, I think I'll keep this iteration to what it is.

Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me 👍

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

‼️ unable to find a baseline build for [feature/task_manager_429@32468ad]. Try merging the upstream branch and trying again.

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@mikecote mikecote merged commit d721fea into elastic:feature/task_manager_429 Sep 9, 2020
mikecote added a commit that referenced this pull request Oct 13, 2020
…ith a 429 (#75666)

* Make task manager maxWorkers and pollInterval observables (#75293)

* WIP step 1

* WIP step 2

* Cleanup

* Make maxWorkers an observable for the task pool

* Cleanup

* Fix test failures

* Use BehaviorSubject

* Add some tests

* Make the task manager store emit error events (#75679)

* Add errors$ observable to the task store

* Add unit tests

* Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (#77096)

* WIP

* Cleanup

* Add error count to message

* Reset observable values on stop

* Add comments

* Fix issues when changing configurations

* Cleanup code

* Cleanup pt2

* Some renames

* Fix typecheck

* Use observables to manage throughput

* Rename class

* Switch to createManagedConfiguration

* Add some comments

* Start unit tests

* Add logs

* Fix log level

* Attempt at adding integration tests

* Fix test failures

* Fix timer

* Revert "Fix timer"

This reverts commit 0817e5e.

* Use Symbol

* Fix merge scan

* replace startsWith with a timer that is scheduled to 0

* typo

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Gidi Meir Morris <[email protected]>
mikecote added a commit to mikecote/kibana that referenced this pull request Oct 13, 2020
…ith a 429 (elastic#75666)

* Make task manager maxWorkers and pollInterval observables (elastic#75293)

* WIP step 1

* WIP step 2

* Cleanup

* Make maxWorkers an observable for the task pool

* Cleanup

* Fix test failures

* Use BehaviorSubject

* Add some tests

* Make the task manager store emit error events (elastic#75679)

* Add errors$ observable to the task store

* Add unit tests

* Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (elastic#77096)

* WIP

* Cleanup

* Add error count to message

* Reset observable values on stop

* Add comments

* Fix issues when changing configurations

* Cleanup code

* Cleanup pt2

* Some renames

* Fix typecheck

* Use observables to manage throughput

* Rename class

* Switch to createManagedConfiguration

* Add some comments

* Start unit tests

* Add logs

* Fix log level

* Attempt at adding integration tests

* Fix test failures

* Fix timer

* Revert "Fix timer"

This reverts commit 0817e5e.

* Use Symbol

* Fix merge scan

* replace startsWith with a timer that is scheduled to 0

* typo

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Gidi Meir Morris <[email protected]>
mikecote added a commit that referenced this pull request Oct 13, 2020
…ith a 429 (#75666) (#80355)

* Make task manager maxWorkers and pollInterval observables (#75293)

* WIP step 1

* WIP step 2

* Cleanup

* Make maxWorkers an observable for the task pool

* Cleanup

* Fix test failures

* Use BehaviorSubject

* Add some tests

* Make the task manager store emit error events (#75679)

* Add errors$ observable to the task store

* Add unit tests

* Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur (#77096)

* WIP

* Cleanup

* Add error count to message

* Reset observable values on stop

* Add comments

* Fix issues when changing configurations

* Cleanup code

* Cleanup pt2

* Some renames

* Fix typecheck

* Use observables to manage throughput

* Rename class

* Switch to createManagedConfiguration

* Add some comments

* Start unit tests

* Add logs

* Fix log level

* Attempt at adding integration tests

* Fix test failures

* Fix timer

* Revert "Fix timer"

This reverts commit 0817e5e.

* Use Symbol

* Fix merge scan

* replace startsWith with a timer that is scheduled to 0

* typo

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Gidi Meir Morris <[email protected]>

Co-authored-by: Kibana Machine <[email protected]>
Co-authored-by: Gidi Meir Morris <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t//

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants