Race conditions when running integration tests with cargo-nextest #445

mati865 · 2025-02-19T01:10:57Z

Since a long time cargo-nextest is my go-to tests runner, but using it with this project often results in race conditions leading to a spurious failures with errors like:

Error: Undefined symbol exit_syscall, referenced by /home/mateusz/Projects/wild/wild/tests/build/non-alloc.default-host-4cc03bc82a7d4860.o
ld.lld: error: /home/mateusz/Projects/wild/wild/tests/build/exit.default-host-2b99faac1607d4db.o: unknown file type
/home/mateusz/Projects/wild/wild/tests/build/exit.default-host-2b99faac1607d4db.o: file not recognized: file format not recognized
ld.lld: error: cannot open /home/mateusz/Projects/wild/wild/tests/build/exit.default-host-2b99faac1607d4db.o: No such file or directory
ld.lld: error: undefined symbol: exit_syscall >>> referenced by /home/mateusz/Projects/wild/wild/tests/build/stack_alignment.default-host-1051da6f98d46ec5.o:(.text+0x11)
and so on...

The spurious failures can happen with any linker: ld, lld, wild. The problem gets worse with more threads, with default 32 threads (Ryzen 5950X) running rm -rf wild/tests/build/*; cargo nextest run integration_test is basically guaranteed to fail, but I've also seen failures with -j4.

I believe this is an issues with wild's test but didn't figure out how to debug it yet.

Why cargo-nextest?

It makes searching for the failures much easier when multiple tests failures; For example compare the output from program_name_31___cpp_integration_cc__ in https://gist.github.com/mati865/52e2dc2ac8f0e5c8c9b117a95642cf67#file-gistfile1-txt vs https://gist.github.com/mati865/52e2dc2ac8f0e5c8c9b117a95642cf67#file-gistfile2-txt
With cargo t wild's error is in totally different place, nextest doesn't have this issue.

The text was updated successfully, but these errors were encountered:

davidlattimore · 2025-02-19T01:53:46Z

I've also been using cargo-textest for a while. It's worked reasonably well for me, but I only have 8 cores in my laptop. I just tried running with -j32 and I'm able to reproduce similar failures. Looking into it now... my approach with this kind of thing is to strace the build, then try to make sense of what happened from that. We'll see what I find.

davidlattimore · 2025-02-19T02:18:19Z

Ah, I think I might understand what's going on. Our integration tests have a mutex that they hold while checking whether to create a file, then subsequently creating it. However, it looks like nextest runs the tests from multiple separate processes, not from multiple separate threads. The only option I can think of to fix this would be to use file locking instead of mutexes.

davidlattimore · 2025-02-19T02:40:59Z

I had a quick look and file-guard looks like a good option for file locking. There's also file-lock, but it has less usage and also looks like it depends on building some C code. I'll leave this for you if you'd like to look at it, or if you'd rather not, I'll do it or open it up for someone else to do.

mati865 · 2025-02-19T07:53:30Z

Thanks for the analysis and good catch!
There is an fs2 crate that implements locking and is written in pure Rust.
It works well in a commercial system at my work, so I imagine it should work for wild as well.
Another option I have heard about but didn't try isfd_lock, it's also pure Rust.

mati865 · 2025-02-19T21:18:21Z

I have performed the tests and fd-lock crate is virtually as fast as the previous mutex: https://gist.github.com/mati865/bed159ec46a60799d2a013cab6670eb5

Explanation of the scenarios:

01_clean.md - rm -rf wild/tests/build* was performed before each benchmark pass
02_touch.md - touch wild/tests/sources/libc* wild/tests/sources/tls* was performed before each benchmark pass
03_hot.md - no changes

The features:

no_lock - no locking of any kind
mutex - original implementation with mutex
rw_lock - read only lock for reading part and write lock for writing part
w_lock - write lock for the whole part (a drop in replacement for current mutex)

Given these results I'll go for w_lock implementation.

mati865 self-assigned this Feb 19, 2025

mati865 mentioned this issue Feb 19, 2025

Replace tests synchronisation mutex with file lock #449

Merged

davidlattimore closed this as completed in #449 Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race conditions when running integration tests with cargo-nextest #445

Race conditions when running integration tests with cargo-nextest #445

mati865 commented Feb 19, 2025

davidlattimore commented Feb 19, 2025

davidlattimore commented Feb 19, 2025

davidlattimore commented Feb 19, 2025

mati865 commented Feb 19, 2025

mati865 commented Feb 19, 2025

Race conditions when running integration tests with cargo-nextest #445

Race conditions when running integration tests with cargo-nextest #445

Comments

mati865 commented Feb 19, 2025

I believe this is an issues with wild's test but didn't figure out how to debug it yet.

davidlattimore commented Feb 19, 2025

davidlattimore commented Feb 19, 2025

davidlattimore commented Feb 19, 2025

mati865 commented Feb 19, 2025

mati865 commented Feb 19, 2025