Skip to content

nixos/test-driver: provide a global timeout#262839

Merged
RaitoBezarius merged 4 commits intoNixOS:masterfrom
RaitoBezarius:qemu-vm/timeout
Oct 29, 2023
Merged

nixos/test-driver: provide a global timeout#262839
RaitoBezarius merged 4 commits intoNixOS:masterfrom
RaitoBezarius:qemu-vm/timeout

Conversation

@RaitoBezarius
Copy link
Member

@RaitoBezarius RaitoBezarius commented Oct 23, 2023

Description of changes

Since the debut of the test-driver, we didn't obtain a race timer with the test execution to ensure that tests doesn't run beyond a certain amount of time.

This is particularly important when you are running into hanging tests which cannot be detected by current facilities (requires more pvpanic wiring up, QMP API stuff, etc.).

Two easy examples:

  • Some QEMU tests may get stuck in some situation and run for more than 24 hours → we default to 24 hours max.
  • Some QEMU tests may panic in the wrong place, e.g. UEFI firmware or worse → end users can set a "reasonable" amount of time

And then, we should let the retry logic retest them until they succeed and adjust their global timeouts.

Of course, this does not help with the fact that the timeout may need to be a function of the actual busyness of the machine running the tests. This is only one step towards increased reliability.

Self tests are offered via testBuildFailure.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.11 Release Notes (or backporting 23.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@RaitoBezarius RaitoBezarius requested a review from tfc as a code owner October 23, 2023 00:11
@github-actions github-actions bot added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Oct 23, 2023
@RaitoBezarius RaitoBezarius requested a review from roberth October 23, 2023 00:12
@roberth
Copy link
Member

roberth commented Oct 23, 2023

👍 this is more robust than meta.timeout, e.g. when tests are accumulated with symlinkJoin or some reporting derivation.
Would love to use this in Nix itself to report which tests have run in which test environments. That's currently far too opaque.

TODO: self tests?

You could write a test with a short timeout and a long sleep somewhere, then transform it with testers.testBuildFailure, which produces a log in its $out.
You can then grep the log from a runCommand to make sure it failed in the right way (ie not due to an unrelated error).

@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. labels Oct 23, 2023
@tfc
Copy link
Contributor

tfc commented Oct 23, 2023

General timeouts are useful and generally an improvement for CI throughput.

I have two thoughts about this change:

For how many people is a test without a timeout useful at all? In interactive mode (which could run for days for playing with), the timer is not started, which makes sense.
As much as i like "maybe" values, we could get rid of this complexity by simply always having a timeout. And people who want no timeout can still simulate it by setting it to a week or so (and interactive users don't need to touch it at all).

The default value of 24 hours is already very relaxed. What is our knowledge about the longest legitimately running test? It might make sense to already use this CI-throughput-improving functionality by setting a useful halfway strict global timeout (as it still can be set to longer times for tests which really need it).

@RaitoBezarius
Copy link
Member Author

General timeouts are useful and generally an improvement for CI throughput.

I have two thoughts about this change:

For how many people is a test without a timeout useful at all? In interactive mode (which could run for days for playing with), the timer is not started, which makes sense.
As much as i like "maybe" values, we could get rid of this complexity by simply always having a timeout. And people who want no timeout can still simulate it by setting it to a week or so (and interactive users don't need to touch it at all).

Sounds reasonable!

The default value of 24 hours is already very relaxed. What is our knowledge about the longest legitimately running test? It might make sense to already use this CI-throughput-improving functionality by setting a useful halfway strict global timeout (as it still can be set to longer times for tests which really need it).

Discussed it on infra and 1 hour seems better.

@RaitoBezarius RaitoBezarius requested a review from nikstur October 23, 2023 10:02
Copy link
Contributor

@nikstur nikstur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very very cool

Copy link
Contributor

@blitz blitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it! ❤️

@RaitoBezarius RaitoBezarius force-pushed the qemu-vm/timeout branch 3 times, most recently from df669ce to c144828 Compare October 23, 2023 12:14
@RaitoBezarius
Copy link
Member Author

For self testing, I am reviving #157161 and plan to use it as a base for the self test here.

@RaitoBezarius
Copy link
Member Author

@roberth Done for everything in this scope. I would suggest to get this first probably. :)

Since the debut of the test-driver, we didn't obtain
a race timer with the test execution to ensure that tests doesn't run beyond
a certain amount of time.

This is particularly important when you are running into hanging tests
which cannot be detected by current facilities (requires more pvpanic wiring up, QMP
API stuff, etc.).

Two easy examples:

- Some QEMU tests may get stuck in some situation and run for more than 24 hours → we default to 1 hour max.
- Some QEMU tests may panic in the wrong place, e.g. UEFI firmware or worse → end users can set a "reasonable" amount of time

And then, we should let the retry logic retest them until they succeed and adjust
their global timeouts.

Of course, this does not help with the fact that the timeout may need to be
a function of the actual busyness of the machine running the tests.
This is only one step towards increased reliability.
For `testBuildFailure` and similar functions, we need a full blown derivation and not a lazy one.
This is an internal option for test framework developers.
We test that the test framework timeouts are working as expected.
From now on, we will aim to ensure that the test driver
gets tested by OfBorg using all our available tests.

This commit adds the driver timeout test to the driver.
@RaitoBezarius RaitoBezarius merged commit 92fdbd2 into NixOS:master Oct 29, 2023
@RaitoBezarius RaitoBezarius deleted the qemu-vm/timeout branch October 29, 2023 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants