Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meson test fixups #3120

Merged
merged 5 commits into from
Dec 17, 2022
Merged

Meson test fixups #3120

merged 5 commits into from
Dec 17, 2022

Conversation

eli-schwartz
Copy link
Contributor

@eli-schwartz
Copy link
Contributor Author

After manually installing these fixes into the Meson WrapDB, I finally got both Windows and macOS (new additions to our test matrix) to pass CI: https://github.com/mesonbuild/wrapdb/runs/6089540648?check_suite_focus=true

@@ -183,6 +185,8 @@ test('test-zstream-1',
test('test-zstream-3',
zstreamtest,
args: ['--newapi', '-t1', ZSTREAM_TESTTIME] + FUZZER_FLAGS,
# --newapi dies on Windows with "exit status 3221225477 or signal 3221225349 SIGinvalid"
should_fail: host_machine_os == os_windows,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the issue is fixed, and zstreamtest --newapi now completes successfully ?
would it be considered a test failure due to should_fail ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes its status to EXPECTEDFAIL, which is a success status. When the bug is fixed, it will start producing an UNEXPECTEDPASS status, which is considered a failure and causes the overall test results to be negative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes its status to EXPECTEDFAIL, which is a success status. When the bug is fixed, it will start producing an UNEXPECTEDPASS status, which is considered a failure and causes the overall test results to be negative.

Better fix the issue then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, once that other PR is merged I will re-check this and drop the obsolete commit.

@eli-schwartz
Copy link
Contributor Author

Is there anything else I should do about this?

As far as discussion over the second commit goes, it seems the attempt to actually fix the issue isn't getting anywhere. I think it's fine to add a "known broken" label to the testsuite.

@Cyan4973
Copy link
Contributor

I don't remember all the details,
it seems that :

  • On Windows + MinGW, and likely Windows + Visual, zstreamtest --newpi fails sometimes. But it's not reliably reproducible, meaning it fails at different places between consecutive runs on the same exact test.
  • This seems to points at a race condition in the multi-threading code. And it's possibly related to the shim pthread translation layer for Windows, since there is no such issue when using pthread directly, as in Windows + Cygwin combination for example, which works fine.
  • The proposal in this PR is to hide the issue by disabling this test, or making its result non-blocking, when run on Windows.

I think it's fair to say that I would largely prefer to have the issue fixed.
Problem is, the fix is not yet found, and it's a hard problem because the issue is very difficult to observe.
And a previous fix attempt, in #3121, failed to reach the objective.

Given a choice, I would still prefer to fix the issue.
Problem is, finding the time to look into it is difficult, our team is tied by other priorities.
We can still make this item a priority objective for the next release, but it doesn't tell when someone will take care of it.
It just imply that it's probably going to happen this year.

@eli-schwartz
Copy link
Contributor Author

The proposal in this PR is to hide the issue by disabling this test, or making its result non-blocking, when run on Windows.

To be clear, the proposal is to make it be non-blocking, but visibly reported:

passing tests, yay : XXX
failing tests, phew: 0
known broken? :/   : 1

This is something that the Test Anything protocol calls "todo" tests: https://testanything.org/tap-specification.html#todo-tests

These tests represent a feature to be implemented or a bug to be fixed and act as something of an executable “things to do” list.

Automake's "Generalities about testing" mentions:

It’s not uncommon, especially during early development stages, that some tests fail for known reasons, and that the developer doesn’t want to tackle these failures immediately (this is especially true when the failing tests deal with corner cases).

It's not uncommon to make a test failure be non-blocking when you've temporarily exhausted your efforts at solving it (and you have a tracking issue for it, and the testsuite flags it as a FIXME). If the test failure doesn't indicate a critical release blocker, then you want the testsuite status to be useful -- i.e. it should be non-blocking by default, and only raise critical errors if new or unknown issues are introduced.

@Cyan4973
Copy link
Contributor

OK, this will be discussed with the team, so that we can provide a feedback on this PR in a reasonable delay.

@eli-schwartz
Copy link
Contributor Author

Alright, thanks for looking into it. :)

@Cyan4973
Copy link
Contributor

Follow up : @terrelln is going to have a look at this topic.
It requires a working Windows machine, with development tools, so it's gonna take a little time, but hopefully nothing too long.
The hope is that a fix can be found quickly, on the ground that the problem can be reproduced pretty easily.
But if not, we'll revisit this proposal of having a non-blocking test while waiting for a (longer term) fix.

@eli-schwartz
Copy link
Contributor Author

Any update on this? :)

@eli-schwartz
Copy link
Contributor Author

@terrelln did you ever have a chance to investigate this?

@yoniko
Copy link
Contributor

yoniko commented Dec 14, 2022

@terrelln did you ever have a chance to investigate this?

I'll take a look at this today / tomorrow.

@yoniko
Copy link
Contributor

yoniko commented Dec 15, 2022

An update: I've reproduced the problem and found the root cause.
There's indeed a bug in the windows threading translation layer.
Will put up a PR with a fix soon.

@eli-schwartz
Copy link
Contributor Author

Fantastic! Thank you for your investigation. Looking forward to being able to pass tests on the next zstd release. :)

yoniko added a commit to yoniko/zstd that referenced this pull request Dec 16, 2022
1. If threads are resized the threads' `ZSTD_pthread_t` might move while the worker still holds a pointer into it (see more details in facebook#3120).
2. The join operation was waiting for a thread and then return its `thread.arg` as a return value, but since the `ZSTD_pthread_t thread` was passed by value it would have a stale `arg` that wouldn't match the thread's actual return value.

This fix changes the `ZSTD_pthread_join` API and removes support for returning a value. This means that we are diverging from the `pthread_join` API and this is no longer just an alias.
In the future, if needed, we could return a Windows thread's return value using `GetExitCodeThread`, but as this path wouldn't be excised in any case, it's preferable to not add it right now.
eli-schwartz pushed a commit to eli-schwartz/zstd that referenced this pull request Dec 16, 2022
1. If threads are resized the threads' `ZSTD_pthread_t` might move while the worker still holds a pointer into it (see more details in facebook#3120).
2. The join operation was waiting for a thread and then return its `thread.arg` as a return value, but since the `ZSTD_pthread_t thread` was passed by value it would have a stale `arg` that wouldn't match the thread's actual return value.

This fix changes the `ZSTD_pthread_join` API and removes support for returning a value. This means that we are diverging from the `pthread_join` API and this is no longer just an alias.
In the future, if needed, we could return a Windows thread's return value using `GetExitCodeThread`, but as this path wouldn't be excised in any case, it's preferable to not add it right now.
yoniko added a commit to yoniko/zstd that referenced this pull request Dec 16, 2022
1. If threads are resized the threads' `ZSTD_pthread_t` might move
while the worker still holds a pointer into it (see more details in facebook#3120).
2. The join operation was waiting for a thread and then return its `thread.arg`
as a return value, but since the `ZSTD_pthread_t thread` was passed by value it
would have a stale `arg` that wouldn't match the thread's actual return value.

This fix changes the `ZSTD_pthread_join` API and removes support for returning
a value. This means that we are diverging from the `pthread_join` API and this
is no longer just an alias.
In the future, if needed, we could return a Windows thread's return value using
`GetExitCodeThread`, but as this path wouldn't be excised in any case, it's
preferable to not add it right now.
@eli-schwartz
Copy link
Contributor Author

Added CI improvements on top. This also includes #3368 which is needed to configure the contrib directory so that CI runs, although that PR is much more easily merged.

Copy link
Contributor

@yoniko yoniko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, my only feedback is that I think it'd be better to pin the third-party action to a specific commit.
After making the change I think it'd be a good idea to merge this and then merge a PR that fixes the issue and removes the expected failure for windows.

.github/workflows/dev-short-tests.yml Outdated Show resolved Hide resolved
yoniko added a commit to yoniko/zstd that referenced this pull request Dec 16, 2022
1. If threads are resized the threads' `ZSTD_pthread_t` might move
while the worker still holds a pointer into it (see more details in facebook#3120).
2. The join operation was waiting for a thread and then return its `thread.arg`
as a return value, but since the `ZSTD_pthread_t thread` was passed by value it
would have a stale `arg` that wouldn't match the thread's actual return value.

This fix changes the `ZSTD_pthread_join` API and removes support for returning
a value. This means that we are diverging from the `pthread_join` API and this
is no longer just an alias.
In the future, if needed, we could return a Windows thread's return value using
`GetExitCodeThread`, but as this path wouldn't be excised in any case, it's
preferable to not add it right now.
yoniko added a commit to yoniko/zstd that referenced this pull request Dec 16, 2022
1. If threads are resized the threads' `ZSTD_pthread_t` might move
while the worker still holds a pointer into it (see more details in facebook#3120).
2. The join operation was waiting for a thread and then return its `thread.arg`
as a return value, but since the `ZSTD_pthread_t thread` was passed by value it
would have a stale `arg` that wouldn't match the thread's actual return value.

This fix changes the `ZSTD_pthread_join` API and removes support for returning
a value. This means that we are diverging from the `pthread_join` API and this
is no longer just an alias.
In the future, if needed, we could return a Windows thread's return value using
`GetExitCodeThread`, but as this path wouldn't be excised in any case, it's
preferable to not add it right now.
It's entirely possible some people don't have valgrind installed, but
still want to run the tests. If they don't have it installed, then they
probably don't intend to run those precise test targets anyway.

Also, this solves an error when running the tests in an automated
environment. The valgrind tests have a hard dependency on behavior such
as `./zstd` erroring out with the message "stdin is a console, aborting"
which does not work if the automated environment doesn't have a console.
As a rough heuristic, automated environments lacking a console will
*probably* also not have valgrind, so avoiding that test definition
neatly sidesteps the issue.

Also, valgrind is not easily installable on macOS, at least homebrew
says it isn't available there. This makes it needlessly hard to
enable the testsuite on macOS.
playTests.sh has an option to run really slow tests. This is enabled by
default in Meson, but what we really want is to do like the Makefile,
and run the fast ones by default, but with an option to run the slow
ones instead.
Travis is no longer run, but this wasn't ported to something else.
There are a couple of oddities here. We don't attempt to build e.g.
contrib, because that doesn't seem to work at the moment. Also notice
that each command is its own step. This happens because github actions
runs in powershell, which doesn't seem to let you abort on the first
failure.
@yoniko
Copy link
Contributor

yoniko commented Dec 17, 2022

Looks good, although a bit unstable as the windows test might actually pass sometimes.
Merging so we can merge the other PR that solves the bug and would fix the CI to expect to pass.

@yoniko yoniko merged commit ce61cb8 into facebook:dev Dec 17, 2022
yoniko added a commit to yoniko/zstd that referenced this pull request Dec 17, 2022
1. If threads are resized the threads' `ZSTD_pthread_t` might move
while the worker still holds a pointer into it (see more details in facebook#3120).
2. The join operation was waiting for a thread and then return its `thread.arg`
as a return value, but since the `ZSTD_pthread_t thread` was passed by value it
would have a stale `arg` that wouldn't match the thread's actual return value.

This fix changes the `ZSTD_pthread_join` API and removes support for returning
a value. This means that we are diverging from the `pthread_join` API and this
is no longer just an alias.
In the future, if needed, we could return a Windows thread's return value using
`GetExitCodeThread`, but as this path wouldn't be excised in any case, it's
preferable to not add it right now.
yoniko added a commit to yoniko/zstd that referenced this pull request Dec 17, 2022
1. If threads are resized the threads' `ZSTD_pthread_t` might move
while the worker still holds a pointer into it (see more details in facebook#3120).
2. The join operation was waiting for a thread and then return its `thread.arg`
as a return value, but since the `ZSTD_pthread_t thread` was passed by value it
would have a stale `arg` that wouldn't match the thread's actual return value.

This fix changes the `ZSTD_pthread_join` API and removes support for returning
a value. This means that we are diverging from the `pthread_join` API and this
is no longer just an alias.
In the future, if needed, we could return a Windows thread's return value using
`GetExitCodeThread`, but as this path wouldn't be excised in any case, it's
preferable to not add it right now.
@eli-schwartz eli-schwartz deleted the meson-fixup branch December 17, 2022 23:23
@eli-schwartz
Copy link
Contributor Author

Oops, yeah, I think I grabbed a hash for the msvc-dev-cmd action, from the wrong branch.

@eli-schwartz
Copy link
Contributor Author

Honestly, I really wish it was possible to target semantically meaningful versions of an action, like I originally did, and specify a checksum to verify, instead of abusing commit hashes. But GitHub doesn't support that. :(

@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants