Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix races in test-performance-eventlooputil #36028

Closed
wants to merge 1 commit into from

Conversation

Flarna
Copy link
Member

@Flarna Flarna commented Nov 8, 2020

Fix two races in test-performance-eventlooputil resulting in a flaky test.

elu1 was capture after start time t from spin look. If OS descides to reschedule the process after capturing t but before getting elu for >=50ms the spin loop is actually a nop. elu1 doesn't show this and as a result elut3 = eventLoopUtilization(elu1) results in
elu3.active is lower then the expected spin time.
Moving capturing of t after capturing t, just before the spin look
avoids this.

Similar if OS decides to shedule a different process between getting the total elu from start and the diff elu showing the spin loop the check to verify that total active time is long then the spin loop
fails.
Exchanging these statements avoids this race.

Fixes: #35309

fyi @trevnorris

Fix two races in test-performance-eventlooputil resulting in a flaky
test.

elu1 was capture after start time t from spin look. If OS descides to
reschedule the process after capturing t but before getting elu for
>=50ms the spin loop is actually a nop. elu1 doesn't show this and as
a result elut3 = eventLoopUtilization(elu1) results in
elu3.active === 0.
Moving capturing of t after capturing t, just before the spin look
avoids this.

Similar if OS decides to shedule a different process between getting
the total elu from start and the diff elu showing the spin loop the
check to verify that total active time is long then the spin loop
fails.
Exchanging these statements avoids this race.
@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Nov 8, 2020
@Flarna Flarna added the request-ci Add this label to start a Jenkins CI on a PR. label Nov 8, 2020
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Nov 8, 2020
@nodejs-github-bot
Copy link
Collaborator

CI: https://ci.nodejs.org/job/node-test-pull-request/34186/

@Flarna Flarna added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Nov 8, 2020
@Trott
Copy link
Member

Trott commented Nov 9, 2020

Trying to start a stress test for this but CI stress test is not starting for some reason. Anyway, a similar stress test a couple days ago against master showed 12 failures on osx1015 in 1000 runs. So hopefully we can run that on this PR, see that it fixes things, and I can approve this and close my PR to move this test to sequential.

@Flarna
Copy link
Member Author

Flarna commented Nov 9, 2020

@Trott Please note that you refer to a different test (test-worker-eventlooputil) - this PR modifies test-performance-eventlooputil

@Flarna
Copy link
Member Author

Flarna commented Nov 9, 2020

Copy link
Contributor

@trevnorris trevnorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on getting the test as it is on my machine to fail from a stress test, but doesn't want to.

Mind checking if moving setting elu1, but not changing anything else, fixes the issue?

@@ -23,28 +23,28 @@ if (nodeTiming.loopStart === -1) {
}

setTimeout(mustCall(function r() {
const t = Date.now();
const elu1 = eventLoopUtilization();
Copy link
Contributor

@trevnorris trevnorris Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I messed up here. elu1 also shouldn't be captured until after some idleTime has accumulated. This function call was only meant supposed to check if .active was being set correctly.

The reason for forcing idleTime to accumulate before the test ran was so it would execute the same running it directly or running with ./tools/test.py --worker.

@Flarna
Copy link
Member Author

Flarna commented Nov 10, 2020

Mind checking if moving setting elu1, but not changing anything else, fixes the issue?

As far as I remember I tried this but then the second race mentioned in description jumped in and resulted in a sporadic fail in the next assert.
The easiest way to get the test failing is to put load on your machine. Start the load generators you like with same prio then the node processes.
To my understaning the elu active time grows also if the process is not scheduled at all as OS assignes no CPU. So another option is to fake this by adding a busy loop between calc of elu2 and elu3 which is the same from elu point of view as not getting a CPU.

@trevnorris
Copy link
Contributor

@Flarna Thanks. Was able to reproduce. See what you're saying. I'm testing out some variations w/ you PR. Will get back in a bit.

@nodejs nodejs deleted a comment from Flarna Nov 10, 2020
Copy link
Contributor

@trevnorris trevnorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Flarna Thanks for the fix. Looks good.

@Trott
Copy link
Member

Trott commented Nov 10, 2020

Landed in 135dd88. Thanks!

@Trott Trott closed this Nov 10, 2020
Trott pushed a commit to Trott/io.js that referenced this pull request Nov 10, 2020
Fix two races in test-performance-eventlooputil resulting in a flaky
test.

elu1 was capture after start time t from spin look. If OS descides to
reschedule the process after capturing t but before getting elu for
>=50ms the spin loop is actually a nop. elu1 doesn't show this and as
a result elut3 = eventLoopUtilization(elu1) results in
elu3.active === 0.
Moving capturing of t after capturing t, just before the spin look
avoids this.

Similar if OS decides to shedule a different process between getting
the total elu from start and the diff elu showing the spin loop the
check to verify that total active time is long then the spin loop
fails.
Exchanging these statements avoids this race.

PR-URL: nodejs#36028
Fixes: nodejs#35309
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
@Flarna Flarna deleted the test-elu-races branch November 10, 2020 14:01
danielleadams pushed a commit that referenced this pull request Nov 10, 2020
Fix two races in test-performance-eventlooputil resulting in a flaky
test.

elu1 was capture after start time t from spin look. If OS descides to
reschedule the process after capturing t but before getting elu for
>=50ms the spin loop is actually a nop. elu1 doesn't show this and as
a result elut3 = eventLoopUtilization(elu1) results in
elu3.active === 0.
Moving capturing of t after capturing t, just before the spin look
avoids this.

Similar if OS decides to shedule a different process between getting
the total elu from start and the diff elu showing the spin loop the
check to verify that total active time is long then the spin loop
fails.
Exchanging these statements avoids this race.

PR-URL: #36028
Fixes: #35309
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
@danielleadams danielleadams mentioned this pull request Nov 10, 2020
richardlau pushed a commit that referenced this pull request Mar 29, 2021
Fix two races in test-performance-eventlooputil resulting in a flaky
test.

elu1 was capture after start time t from spin look. If OS descides to
reschedule the process after capturing t but before getting elu for
>=50ms the spin loop is actually a nop. elu1 doesn't show this and as
a result elut3 = eventLoopUtilization(elu1) results in
elu3.active === 0.
Moving capturing of t after capturing t, just before the spin look
avoids this.

Similar if OS decides to shedule a different process between getting
the total elu from start and the diff elu showing the spin loop the
check to verify that total active time is long then the spin loop
fails.
Exchanging these statements avoids this race.

PR-URL: #36028
Fixes: #35309
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
@richardlau richardlau mentioned this pull request Mar 29, 2021
targos pushed a commit that referenced this pull request May 1, 2021
Fix two races in test-performance-eventlooputil resulting in a flaky
test.

elu1 was capture after start time t from spin look. If OS descides to
reschedule the process after capturing t but before getting elu for
>=50ms the spin loop is actually a nop. elu1 doesn't show this and as
a result elut3 = eventLoopUtilization(elu1) results in
elu3.active === 0.
Moving capturing of t after capturing t, just before the spin look
avoids this.

Similar if OS decides to shedule a different process between getting
the total elu from start and the diff elu showing the spin loop the
check to verify that total active time is long then the spin loop
fails.
Exchanging these statements avoids this race.

PR-URL: #36028
Fixes: #35309
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
@danielleadams danielleadams mentioned this pull request May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

investigate flaky test-performance-eventlooputil
6 participants