Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve supervisor restart calculation #8261

Merged

Conversation

Maria-12648430
Copy link
Contributor

@Maria-12648430 Maria-12648430 commented Mar 13, 2024

Before restarting a child, a supervisor must check if the restart limit is reached. This adds a penalty to the overall restart time, which should be kept low.

The current implementation does this check by traversing the list of restarts in order to filter out those that have expired. Then it essentially traverses the result list via length in order to check if it is over the intensity limit. This behavior is 2*O(n) (?), with n being the number of past restarts within the period.

This PR introduces two optimizations:

  • it checks whether the restart limit is reached while it is traversing the restart list in order to remove expired restarts, thereby eliminating the need for an additional traversal via the call to length. Depending on the outcome, a restart is either allowed or disallowed. This behavior is O(n).

  • it sidesteps the need to perform the step above by keeping a separate counter for restarts; as long as that counter is below the intensity value, it is safe to allow the restart, add the restart to the list, and increment the counter. This behavior is O(1).

    Only when the counter reaches the intensity limit, the actual number of restarts within the given period must be calculated via the step above; if the restart is allowed, the restart list is updated and the counter set to the according value.

    (Over time, this may lead to a large list of accumulated expired restarts being carried around. For this reason, the counter is limited not by the intensity value alone but rather by the minimum of the intensity value and a hardcoded limit. By gut feeling, I picked 1000)

@Maria-12648430 Maria-12648430 marked this pull request as ready for review March 13, 2024 16:03
Copy link
Contributor

github-actions bot commented Mar 13, 2024

CT Test Results

    2 files     94 suites   34m 30s ⏱️
2 042 tests 1 994 ✅ 48 💤 0 ❌
2 351 runs  2 301 ✅ 50 💤 0 ❌

Results for commit 6b31837.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@rickard-green rickard-green added the team:PS Assigned to OTP team PS label Mar 18, 2024
@IngelaAndin IngelaAndin added the stalled waiting for input by the Erlang/OTP team label Mar 20, 2024
@IngelaAndin IngelaAndin self-assigned this Mar 20, 2024
@IngelaAndin IngelaAndin added this to the OTP-28.0 milestone Mar 20, 2024
@IngelaAndin
Copy link
Contributor

New optimizations are way to dangerous to include so close to the release, you never know what timing bugs that might be revealed, so we are postponing this for OTP-28.

@IngelaAndin IngelaAndin added testing currently being tested, tag is used by OTP internal CI and removed stalled waiting for input by the Erlang/OTP team labels Aug 19, 2024
@IngelaAndin
Copy link
Contributor

Looks good in the test, that is I can not see any new test failures that seems related to this. I will merge for OTP-28 so that we have plenty of time to test it further. I guess most other people, the us, will however not test it until the first release candidate is out.

@IngelaAndin IngelaAndin merged commit 494647a into erlang:master Aug 23, 2024
17 checks passed
@IngelaAndin
Copy link
Contributor

@Maria-12648430 seems our systems where lagging a little :( Now I see failures for two test cases that are caused by this PR.

The error I get manifests as following in the application SASL test suite:


 === Location: [{release_handler_SUITE,upgrade_supervisor,[1664](http://otp.ericsson.se/product/internal/test/test_results/progress_28/2024_08_25/otp_28_sebastian_linux-gnu_x86_64_64-bit_jit_s4_kp_a1_no_prevent_overlapping_partitions_docker_111355_28/[email protected]_01.08.20/test.sasl_test.logs/run.2024-08-26_01.08.26/release_handler_suite.src.html#1664)},
              {test_server,ts_tc,1794},
              {test_server,run_test_case_eval1,1303},
              {test_server,run_test_case_eval,1235}]
=== === Reason: no match of right hand side value 
                 {state,{local,a_sup},
                        one_for_all,
                        {[a],
                         #{a =>
                               {child,<20453.100.0>,a,
                                      {a,start_link,[]},
                                      permanent,false,brutal_kill,worker,
                                      [a]}}},
                        undefined,4,3600,[],0,0,never,a_sup,[]}
  in function  release_handler_SUITE:upgrade_supervisor/1 (release_handler_SUITE.erl, line 1664)
  in call from test_server:ts_tc/3 (test_server.erl, line 1794)
  in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1303)
  in call from test_server:run_test_case_eval/9 (test_server.erl, line 1235)

Could you start by seeing I you can recreate it in your test environment ?

@Maria-12648430
Copy link
Contributor Author

Hi @IngelaAndin, I think I can look at it tomorrow or the day after, right now I'm on my mobile, ie not anywhere near a real computer ;) But it looks like the test tries to match on the supervisor state record, which got a new field with this PR.

@Maria-12648430
Copy link
Contributor Author

Maria-12648430 commented Aug 27, 2024

Ah, yes, this is the failing line:

{state,_,RestartStrategy,{[a],Db},_,_,_,_,_,_,_,_} = State,

Just adding another _ to the tuple should do the trick then.

@Maria-12648430
Copy link
Contributor Author

@IngelaAndin ok, I got it sorted out. But you said two test cases were failing? What is the other one?

@IngelaAndin
Copy link
Contributor

@Maria-12648430 Well same test is run in two groups (relative and absolute), so your fix should fix both tests as they are the same in this respect. Could you please make a PR that updates the test suite?

Maria-12648430 added a commit to Maria-12648430/otp that referenced this pull request Aug 28, 2024
@Maria-12648430
Copy link
Contributor Author

Could you please make a PR that updates the test suite?

Sure: #8758 🙂

IngelaAndin added a commit that referenced this pull request Aug 29, 2024
…ade_test

Fix SASL test failing after #8261 was merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:PS Assigned to OTP team PS testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants