This repo is used for tracking flaky tests on the Node.js CI and fixing them.
Current status: work in progress. Please go to the issue tracker to discuss!
Updates should be merged as soon as possible. We can revert or modify afterwards. This repo is mostly for coordination so we need to move fast and reduce the noise.
Make the CI green again.
- Taking actual failures from PRs into account, at least 80% of the node-test-pull-request (or node-test-commit) CI runs should be green.
- At least 90% of the node-daily-master CI should be green.
-
A green CI run is a run with a SUCCESS status, UNSTABLE does not count as green
-
Taking the last 100 runs, at any given time the green rate is calculated as follows
SUCCESS / (100 - RUNNING - ABORTED)
A GitHub workflow is run every day
to produce reliability reports of the node-test-pull-request
CI and post
it to the issue tracker.
Most work starts with opening the issue tracker of this repository and reading the latest report. If the report is missing, see the actions page for details. GitHub's API restricts the length of issue messages, so whenever the report is too long the workflow can fail to post the issue. But it should still leave a summary in the actions page.
- Check out the
JSTest Failure
section of the latest reliability report. It contains information about the JS tests that failed more than 1 pull requests in the last 100node-test-pull-request
CI runs. The more pull requests a test fail, the higher it would be ranked, and the more likely that it is a flake. - Search the name of the test in the Node.js issue tracker and see if there is already an issue about it. If there is already an issue, check if the failures are similar. Comment with updates if necessary.
- If the flake isn't already tracked by an issue, continue to look into it. In the report of a JS test, check out the pull requests that it fails and see if there is a connection. If the pull requests appear to be unrelated, it is more likely that the test is a flake.
- Search the historical reliability reports with the name of the test in the reliability issue tracker, and see how long the flake has been showing up. Gather information from the historical reports, and open an issue in the Node.js issue tracker to track the flake.
-
If the flake only starts to show up in the recent month, check the historical reports to see precisely when it starts to show up. Look at commits landing on the target branch around the same time using
https://github.com/nodejs/node/commits?since=YYYY-MM-DD
and see if there is any pull request that looks related. If one or more related pull requests can be found, ping the author or the reviewer of the pull request, or the team in charge of the related subsystem in the tracking issue or in private to see if they can come up with a fix to just deflake the test. -
If the test has been flaky for more than a month and no one is actively working on it, it is unlikely to go away on its own, and it's time to mark it as flaky. For example, if
parallel/some-flaky-test.js
has been flaky on Windows in the CI, after making sure that there is an issue tracking it, open a pull request to add the following entry totest/parallel/parallel.status
:[$system==win32] # https://github.com/nodejs/node/issues/<TRACKING_ISSUE_ID> some-flaky-test: PASS,FLAKY
In the reliability reports, Jenkins Failure
, Git Failure
and
Build Failure
are generally infrastructure issues and can be
handled by the nodejs/build
team. Typical infrastructure
issues include:
- The CI machine has trouble pulling source code from the repository
- The CI machine has trouble communicating to the Jenkins server
- Build timing out
- Parent job fails to trigger sub builds
Sometimes infrastructure issues can show up in the tests too, for
example tests can fail with ENOSPAC
(No space left on device), and
the machine needs to be cleaned up to release disk space.
Some infrastructure issues can go away on its own, but if the same kind of infrastructure issue has been failing multiple pull requests and persists for more than a day, it's time to take action.
Check out the Node.js build issue tracker
to see if there is any open issue about this. If there isn't,
open a new issue about it or ask around in the #nodejs-build
channel
in the OpenJS slack.
When reporting infrastructure issues, it's important to include
information about the particular machines where the issues happen.
On the Jenkins job page of the failed CI build where the infrastructure
is reported in the logs (not to be confused with the parent build that
trigger the sub build that has the issues), on the top-right
corner, there is normally a line similar to
Took 16 sec on test-equinix-ubuntu2004_container-armv7l-1
.
In this case, test-equinix-ubuntu2004_container-armv7l-1
is the machine having infrastructure issues, and it's important
to include this information in the report.
- Read the flake database in ncu-ci so people can quickly tell if a failure is a flake
- Automate the report process in ncu-ci
- Migrate existing issues in nodejs/node and nodejs/build, close outdated ones.