Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race conditions when running integration tests with cargo-nextest #445

Closed
mati865 opened this issue Feb 19, 2025 · 5 comments · Fixed by #449
Closed

Race conditions when running integration tests with cargo-nextest #445

mati865 opened this issue Feb 19, 2025 · 5 comments · Fixed by #449
Assignees

Comments

@mati865
Copy link
Collaborator

mati865 commented Feb 19, 2025

Since a long time cargo-nextest is my go-to tests runner, but using it with this project often results in race conditions leading to a spurious failures with errors like:

  • Error: Undefined symbol exit_syscall, referenced by /home/mateusz/Projects/wild/wild/tests/build/non-alloc.default-host-4cc03bc82a7d4860.o
  • ld.lld: error: /home/mateusz/Projects/wild/wild/tests/build/exit.default-host-2b99faac1607d4db.o: unknown file type
  • /home/mateusz/Projects/wild/wild/tests/build/exit.default-host-2b99faac1607d4db.o: file not recognized: file format not recognized
  • ld.lld: error: cannot open /home/mateusz/Projects/wild/wild/tests/build/exit.default-host-2b99faac1607d4db.o: No such file or directory
  • ld.lld: error: undefined symbol: exit_syscall >>> referenced by /home/mateusz/Projects/wild/wild/tests/build/stack_alignment.default-host-1051da6f98d46ec5.o:(.text+0x11)
  • and so on...

The spurious failures can happen with any linker: ld, lld, wild. The problem gets worse with more threads, with default 32 threads (Ryzen 5950X) running rm -rf wild/tests/build/*; cargo nextest run integration_test is basically guaranteed to fail, but I've also seen failures with -j4.

I believe this is an issues with wild's test but didn't figure out how to debug it yet.

Why cargo-nextest?

It makes searching for the failures much easier when multiple tests failures; For example compare the output from program_name_31___cpp_integration_cc__ in https://gist.github.com/mati865/52e2dc2ac8f0e5c8c9b117a95642cf67#file-gistfile1-txt vs https://gist.github.com/mati865/52e2dc2ac8f0e5c8c9b117a95642cf67#file-gistfile2-txt
With cargo t wild's error is in totally different place, nextest doesn't have this issue.

@davidlattimore
Copy link
Owner

I've also been using cargo-textest for a while. It's worked reasonably well for me, but I only have 8 cores in my laptop. I just tried running with -j32 and I'm able to reproduce similar failures. Looking into it now... my approach with this kind of thing is to strace the build, then try to make sense of what happened from that. We'll see what I find.

@davidlattimore
Copy link
Owner

Ah, I think I might understand what's going on. Our integration tests have a mutex that they hold while checking whether to create a file, then subsequently creating it. However, it looks like nextest runs the tests from multiple separate processes, not from multiple separate threads. The only option I can think of to fix this would be to use file locking instead of mutexes.

@davidlattimore
Copy link
Owner

I had a quick look and file-guard looks like a good option for file locking. There's also file-lock, but it has less usage and also looks like it depends on building some C code. I'll leave this for you if you'd like to look at it, or if you'd rather not, I'll do it or open it up for someone else to do.

@mati865
Copy link
Collaborator Author

mati865 commented Feb 19, 2025

Thanks for the analysis and good catch!
There is an fs2 crate that implements locking and is written in pure Rust.
It works well in a commercial system at my work, so I imagine it should work for wild as well.
Another option I have heard about but didn't try isfd_lock, it's also pure Rust.

@mati865 mati865 self-assigned this Feb 19, 2025
@mati865
Copy link
Collaborator Author

mati865 commented Feb 19, 2025

I have performed the tests and fd-lock crate is virtually as fast as the previous mutex: https://gist.github.com/mati865/bed159ec46a60799d2a013cab6670eb5

Explanation of the scenarios:

  • 01_clean.md - rm -rf wild/tests/build* was performed before each benchmark pass
  • 02_touch.md - touch wild/tests/sources/libc* wild/tests/sources/tls* was performed before each benchmark pass
  • 03_hot.md - no changes

The features:

  • no_lock - no locking of any kind
  • mutex - original implementation with mutex
  • rw_lock - read only lock for reading part and write lock for writing part
  • w_lock - write lock for the whole part (a drop in replacement for current mutex)

Given these results I'll go for w_lock implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants