-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve supervisor restart calculation #8261
Improve supervisor restart calculation #8261
Conversation
CT Test Results 2 files 94 suites 34m 30s ⏱️ Results for commit 6b31837. ♻️ This comment has been updated with latest results. To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass. See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally. Artifacts// Erlang/OTP Github Action Bot |
New optimizations are way to dangerous to include so close to the release, you never know what timing bugs that might be revealed, so we are postponing this for OTP-28. |
Looks good in the test, that is I can not see any new test failures that seems related to this. I will merge for OTP-28 so that we have plenty of time to test it further. I guess most other people, the us, will however not test it until the first release candidate is out. |
@Maria-12648430 seems our systems where lagging a little :( Now I see failures for two test cases that are caused by this PR. The error I get manifests as following in the application SASL test suite:
Could you start by seeing I you can recreate it in your test environment ? |
Hi @IngelaAndin, I think I can look at it tomorrow or the day after, right now I'm on my mobile, ie not anywhere near a real computer ;) But it looks like the test tries to match on the supervisor state record, which got a new field with this PR. |
Ah, yes, this is the failing line: otp/lib/sasl/test/release_handler_SUITE.erl Line 1664 in 695c729
Just adding another _ to the tuple should do the trick then.
|
@IngelaAndin ok, I got it sorted out. But you said two test cases were failing? What is the other one? |
@Maria-12648430 Well same test is run in two groups (relative and absolute), so your fix should fix both tests as they are the same in this respect. Could you please make a PR that updates the test suite? |
Sure: #8758 🙂 |
…ade_test Fix SASL test failing after #8261 was merged
Before restarting a child, a supervisor must check if the restart limit is reached. This adds a penalty to the overall restart time, which should be kept low.
The current implementation does this check by traversing the list of restarts in order to filter out those that have expired. Then it essentially traverses the result list via
length
in order to check if it is over the intensity limit. This behavior is2*O(n)
(?), withn
being the number of past restarts within the period.This PR introduces two optimizations:
it checks whether the restart limit is reached while it is traversing the restart list in order to remove expired restarts, thereby eliminating the need for an additional traversal via the call to
length
. Depending on the outcome, a restart is either allowed or disallowed. This behavior isO(n)
.it sidesteps the need to perform the step above by keeping a separate counter for restarts; as long as that counter is below the intensity value, it is safe to allow the restart, add the restart to the list, and increment the counter. This behavior is
O(1)
.Only when the counter reaches the intensity limit, the actual number of restarts within the given period must be calculated via the step above; if the restart is allowed, the restart list is updated and the counter set to the according value.
(Over time, this may lead to a large list of accumulated expired restarts being carried around. For this reason, the counter is limited not by the intensity value alone but rather by the minimum of the intensity value and a hardcoded limit. By gut feeling, I picked 1000)