Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text file busy flake in tests #442

Closed
gakonst opened this issue Jan 13, 2022 · 15 comments · Fixed by #709
Closed

Text file busy flake in tests #442

gakonst opened this issue Jan 13, 2022 · 15 comments · Fixed by #709
Labels
D-chore Difficulty: chore T-bug Type: bug

Comments

@gakonst
Copy link
Member

gakonst commented Jan 13, 2022

For some reason, tests fail sometimes with the error below:

---- multi_runner::tests::sputnik::test_sputnik_abstract_contract stdout ----
compiling...
thread 'multi_runner::tests::sputnik::test_sputnik_abstract_contract' panicked at 'called `Result::unwrap()` on an `Err` value: Text file busy (os error 26)

Other times it fails with this one:

---- multi_runner::tests::sputnik::test_sputnik_abstract_contract stdout ----
compiling...
thread 'multi_runner::tests::sputnik::test_sputnik_abstract_contract' panicked at 'called `Result::unwrap()` on an `Err` value: "/home/runner/.svm/0.8.1/solc-0.8.1": Text file busy (os error 26)

I suspect this might be related to the Solc downloader trying to download a binary multiple times, and write it over an existing file? Not sure.

@gakonst gakonst added the T-bug Type: bug label Jan 13, 2022
@onbjerg
Copy link
Collaborator

onbjerg commented Jan 17, 2022

Going by backtrace:

---- multi_runner::tests::sputnik::test_sputnik_abstract_contract stdout ----
compiling...
thread 'multi_runner::tests::sputnik::test_sputnik_abstract_contract' panicked at 'called `Result::unwrap()` on an `Err` value: Text file busy (os error 26)

Location:
    /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/result.rs:1915:27', forge/src/multi_runner.rs:223:82
stack backtrace:
   0: rust_begin_unwind
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/panicking.rs:101:14
   2: core::result::unwrap_failed
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/result.rs:1617:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/result.rs:1299:23
   4: forge::multi_runner::tests::runner
             at ./src/multi_runner.rs:223:9
   5: forge::multi_runner::tests::test_abstract_contract
             at ./src/multi_runner.rs:251:26
   6: forge::multi_runner::tests::sputnik::test_sputnik_abstract_contract
             at ./src/multi_runner.rs:301:13
   7: forge::multi_runner::tests::sputnik::test_sputnik_abstract_contract::{{closure}}
             at ./src/multi_runner.rs:300:9
   8: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
   9: core::ops::function::FnOnce::call_once
             at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    multi_runner::tests::sputnik::test_sputnik_abstract_contract

We end up at MultiContractRunnerBuilder::build that can only return an Err in one place, which is when it calls project.compile()?. Since we're running with --all-features we end up in Project::svm_compile for the next Result.

My best bet is that either it is that we are building the same contracts in two tests, so the cache/artifacts are being written at the same time, or it is SVM as you suspected.

I don't see a super elegant solution, maybe using serial_test or running with -- --test-threads=1 (slow!)

@gakonst
Copy link
Member Author

gakonst commented Jan 17, 2022

My best bet is that either it is that we are building the same contracts in two tests, so the cache/artifacts are being written at the same time, or it is SVM as you suspected.

@mattsse this seems to align with how we've been thinking about the caching issue?

I think there should be a "right" way to solve it, serial tests would be a big no-no, so let's try to brainstorm it a bit?

@mattsse
Copy link
Member

mattsse commented Jan 17, 2022

so text file busy happens if a program tries to modify a file that's currently being executed or otherwise used in write mode.

this could be a race condition within our svm abstraction.
is there a way to reproduce this?
RUST_LOG=ethers=trace should also be helpful to locate the root cause

@onbjerg
Copy link
Collaborator

onbjerg commented Jan 17, 2022

To reproduce: run cargo test --all --all-features -- --test-threads=high number where high number is the amount of threads you are able to use before your computer flies off into space. For good measure also delete or move $HOME/.svm before running. Run until the test fails. 🤷

I used a thread count of 32 and ran it 2 or 3 times before I encountered the failing test

@mattsse
Copy link
Member

mattsse commented Jan 17, 2022

that makes a bit more sense now,
this happens only during tests, right?

@mattsse
Copy link
Member

mattsse commented Jan 17, 2022

my guess is, two tests run in parallel, both need solc x.y.z which is missing, so they both try to install it...

@gakonst
Copy link
Member Author

gakonst commented Jan 17, 2022

Yeah that's typically what happened in the past, which I thought we had solved with a mutex in the installation step, maybe the mutex got lost during our ethers-solc refactors?

@mattsse
Copy link
Member

mattsse commented Jan 17, 2022

lock in test is still there but perhaps this is an integration test issue where multiple bins are executed

maybe we can try using a lockfile instead, https://crates.io/crates/fslock ?

@mattsse
Copy link
Member

mattsse commented Jan 17, 2022

Oh I think I've found it,

I changed version detection in the new graph implementation

@onbjerg
Copy link
Collaborator

onbjerg commented Jan 18, 2022

I don't think an in-process mutex solves it, think we do need a lockfile. AFAIK tests are built as a special executable that Cargo just runs in parallel or something :/

@gakonst
Copy link
Member Author

gakonst commented Jan 18, 2022

I think adding a lockfile under ~/.svm and the artifacts dir could be a fine solution if we don't have a better one, as long as it gets cleaned up instantly and we document it nicely (it's never cool when the user needs to manually remove index/lockfiles)

@onbjerg
Copy link
Collaborator

onbjerg commented Jan 18, 2022

Ok, looked at it to be sure, it does not run the executable multiple times side by side. Perhaps it's just because we call other install methods than Solc::ensure_installed so the lock should actually be in Solc::install and Solc::blocking_install

gakonst/ethers-rs#808

@gakonst
Copy link
Member Author

gakonst commented Feb 2, 2022

From latest CI:
telegram-cloud-photo-size-2-5463169594622591722-y

The println makes me believe the issue is during the svm installation step here, which proceeds to call svm::install.

I think the offending code is either the std::io::copy or the std::fs::File::create above it

@MATTSE mentioned that if gakonst/ethers-rs#802 doesn't address this, we should consider using a lockfile, given that our existing mutexes in ethers-solc seem to not be getting hit

@roynalnaruto
Copy link
Contributor

@gakonst @mattsse added a PR above that will block the installation until lock file is released (up to a timeout period of 10 seconds).

@gakonst
Copy link
Member Author

gakonst commented Feb 9, 2022

Let's see if this finally fixes it..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D-chore Difficulty: chore T-bug Type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants