Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS: Add timeout to cp_stress #16298

Closed
wants to merge 1 commit into from

Conversation

tonyhutter
Copy link
Contributor

Motivation and Context

cp_stress ZTS test can timeout after 10min on underpowered test systems

Description

cp_stress is getting killed on the new QEMU-based github runners we're developing. The problem is that the QEMU-based testers are so under-powered that the test is taking longer than the 10min maximum that ZTS enforces. Instead, enforce an inter-test-cp_stress timeout so the entire test doesn't get killed.

How Has This Been Tested?

Ran on FreeBSD 13 and Ubuntu 24.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right for what it is, but I wonder: what kind of "underpowered" are we talking here? Because seekflood is more likely to hit things if there are more CPUs. So if there's only one or two here, is it even worth running this?

@tonyhutter
Copy link
Contributor Author

Looks right for what it is, but I wonder: what kind of "underpowered" are we talking here? Because seekflood is more likely to hit things if there are more CPUs. So if there's only one or two here, is it even worth running this?

A little background: I'm currently helping test #15838. The idea there is to convert our buildbot builders to github runners. Or more specifically, run ZTS within VMs on the github runners (using the runner as a VM host). We need to use VMs, since github runners don't natively support all the OSs we test on (like Fedora, Almalinux, FreeBSD, etc). In order to speed up the testing and not run into the 4-hour testing timeout, we're chunking up the the tests into thirds and running it on three VMs running in parallel. That unfortunately means each each VM is going over-provisioned and run slower. That's fine for most ZTS tests since they aren't very CPU bound, but there are some outliers like cp_stress that do run too long and get killed.

So yes, we ideally would want faster runners with more CPUs since they're going to expose race conditions and locking bugs like this quicker. But we also have to make due with what's available.

@robn
Copy link
Member

robn commented Jun 25, 2024

Oh I'm sorry, my question wasn't clear (also it was more of a drive-by mumble than a really big-brained thought).

I more meant, if we know this test is running in an environment where it's unlikely to actually bump into the thing it's checking for (because it isn't parallel enough), should we disable the test in that environment. Like, should it SKIP if it has fewer than N cores or something?

I suppose even "unlikely" is better than "never" though!

@tonyhutter
Copy link
Contributor Author

tonyhutter commented Jun 25, 2024

I think it's still a useful test to run no matter what the hardware is. This popped up today: #16297

@robn
Copy link
Member

robn commented Jun 25, 2024

Oof, hard to argue with results. Also yikes, I'll check it out! 😅

@tonyhutter tonyhutter added the Status: Code Review Needed Ready for review and testing label Jul 12, 2024
cp_stress is getting killed on the new QEMU-based github runners we're
developing.  The problem is that the QEMU-based testers are so
underpowered that the test is taking longer than the 10min maximum
that ZTS enforces.  Instead, enforce an inter-test-cp_stress timeout
so the entire test doesn't get killed.

Signed-off-by: Tony Hutter <[email protected]>
@mcmilk
Copy link
Contributor

mcmilk commented Jul 20, 2024

I created another pull request for this problem here: #16369

This one just takes down the RUNS to the limit of the FreeBSD testings, and then the test work fine on Linux also.
The cp_stress test needs around 2 minutes on FreeBSD and Linux then...

@mcmilk
Copy link
Contributor

mcmilk commented Jul 22, 2024

@tonyhutter - can be closed now 👍

@tonyhutter
Copy link
Contributor Author

Closing in favor of #16369

@tonyhutter tonyhutter closed this Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants