Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: retry flaky tests #9508

Closed
wants to merge 6 commits into from
Closed

ci: retry flaky tests #9508

wants to merge 6 commits into from

Conversation

tony19
Copy link
Contributor

@tony19 tony19 commented Aug 3, 2022

related #9492

Description

This updates the CI test jobs to use retry.ts, which reruns the given command (up to 3 times by default) if its output contains one of the errors indicated in #9492. I picked 3 for the default retry count because in my experiments, the same error can appear twice in a row.

The script also adds a warning annotation to the GitHub workflow run to easily identify the flakiness from the Summary view:

Screen Shot 2022-08-03 at 12 45 41 PM

This script can be run locally like this:

# retry 3 times
npx tsx ./scripts/retry.ts pnpm --color=always test

# retry 5 times
npx tsx ./scripts/retry.ts -r 5 pnpm --color=always test

Testing retry.ts

Here's a script that runs pnpm --color=always test-build 10 times in a row, which causes one of the known errors in Node 16:

#!/usr/bin/env bash -e

for i in `seq 10`; do
  echo "************* test run ${i} *************"
  npx tsx ./scripts/retry.ts pnpm --color=always test-build
done

(Tested on macOS Big Sur with Node 16.16.0)

Screenshot:

Screen Shot 2022-08-03 at 12 24 26 AM

Additional context


What is the purpose of this pull request?

  • Bug fix
  • New Feature
  • Documentation update
  • Other

Before submitting the PR, please make sure you do the following

  • Read the Contributing Guidelines.
  • Read the Pull Request Guidelines and follow the Commit Convention.
  • Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
  • Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. fixes #123).
  • Ideally, include relevant tests that fail without this PR but pass with it.

@sapphi-red sapphi-red added p1-chore Doesn't change code behavior (priority) test labels Aug 3, 2022
scripts/retry.sh Outdated Show resolved Hide resolved
scripts/retry.sh Outdated Show resolved Hide resolved
bluwy
bluwy previously approved these changes Aug 6, 2022
Copy link
Member

@bluwy bluwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! 💚

patak-dev
patak-dev previously approved these changes Aug 8, 2022
Copy link
Member

@patak-dev patak-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @tony19! I think this is a good idea, at least for the node segfault we can't use something from Vitest side (even if there is a retry mechanism). @antfu @sheremet-va looping you to see if this is a good approach (at least for the moment) for the Vite repo, as I think other projects may end up copying what we do here.

@dominikg maybe we should do the same on Vite ecosystem CI?

@dominikg
Copy link
Contributor

dominikg commented Aug 8, 2022

not a fan of a bash script wrapper in general

  • hard to maintain
  • breaks easily
  • potentially abusable if someone snuck in output that it is looking for

other remarks

  • ci timeout for the job would need to be increased
  • really love the warning annotation, going to add that one to my toolbox 👍
  • instead of working around it, we should put more preassure on upstream to fix it. random crashes are not acceptable

@bluwy bluwy linked an issue Aug 9, 2022 that may be closed by this pull request
7 tasks
@tony19
Copy link
Contributor Author

tony19 commented Aug 9, 2022

@dominikg Thanks for the feedback!

I did actually experiment with a Node/zx script, but the startup speed with the shell script seemed superior to me. But points taken, I can switch from shell script to TypeScript to address your concerns.

I also agree with pushing for an upstream fix, but it doesn't look like the Node team is anywhere close to a fix, and they're not sure the fix is even feasible:

@bnoordhuis commented on Jun 29

For reference: the fix is v8/v8@fcdf35e but that commit can't be back-ported because it sits on top of a bigger, backwards incompatible change, v8/v8@578f6be.

For the time being, an automated retry seems like a reasonable workaround to keep the bots going without babysitting them. I'm completely open to other ideas though!


Update: Fix for Node 16 identified here

@tony19 tony19 dismissed stale reviews from patak-dev and bluwy via 3ad9750 August 9, 2022 05:45
@antfu
Copy link
Member

antfu commented Aug 9, 2022

Looks great!

A side note, I guess it might be good to create a CLI wrapper like vitest-retry for others to reuse. Or even we could have it built-in in Vitest's CLI, (cc @sheremet-va WDYT?)

@sheremet-va
Copy link
Member

Or even we could have it built-in in Vitest's CLI, (cc @sheremet-va WDYT?)

Sure, something like --segfault-retry? We have an issue for .flaky and --retry 3 for regular tests, so I don't want to use it for this case.

@antfu
Copy link
Member

antfu commented Aug 9, 2022

Yeah --segfault-retry sounds good to me. And maybe we could have a default to 3 or so, aware of CI env etc.

@tony19
Copy link
Contributor Author

tony19 commented Aug 9, 2022

@antfu @sheremet-va Yes, please! 😄 A native vitest fix would be so helpful! Let me know if I could help.

@dominikg
Copy link
Contributor

dominikg commented Aug 9, 2022

A flag on vitest bin makes it reusable which is great!

Not sure how i feel about it being enabled by default, tbh i havn't seen this segfault happening in vite-plugin-svelte suite (which is a bit smaller but uses the same setup as vite).

It should be something users conciously opt-in to, so as they are aware of it. If it is enabled and can detect github action event to log the warning card, that would be awesome.

@antfu
Copy link
Member

antfu commented Aug 15, 2022

@tony19
Copy link
Contributor Author

tony19 commented Aug 17, 2022

Node published a v16 fix in 16.17.0. I verified using that version eliminates the segfaults in CI.

@tony19 tony19 deleted the fix/retry-flaky-tests-in-ci branch October 23, 2022 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-chore Doesn't change code behavior (priority) test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky tests
7 participants