ParityDb migration tests are flaky #7609

altonen · 2023-08-14T10:24:18Z

PR #7601 has now failed twice on chains_db::upgrade::tests::test_paritydb_migrate_2_to_3 test. The issue is also reproducible locally. If the tests are run in a loop, both test_paritydb_migrate_2_to_3 and test_paritydb_migrate_1_to_2 fail but not consistently.

The error that is printed:

---- parachains_db::upgrade::tests::test_paritydb_migrate_1_to_2 stdout ----
thread 'parachains_db::upgrade::tests::test_paritydb_migrate_1_to_2' panicked at 'called `Result::unwrap()` on an `Err` value: Locked(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" })', node/service/src/parachains_db/upgrade.rs:387:62
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

These tests failures are also observable on master. Bisect revealed #7337 to be the PR that introduced regression. cc @mrcnski @s0me0ne-unkn0wn

The text was updated successfully, but these errors were encountered:

ggwpez · 2023-08-14T10:25:53Z

Yea we may need to comment those if they impede the development workflow.

mrcnski · 2023-08-14T10:51:52Z

Trying to reproduce but unsuccessful so far. What command did you run? I'm doing

while cargo ltest test_paritydb_migrate_2_to_3; do :; done

mrcnski · 2023-08-14T10:53:35Z

OK I get it if I just do while cargo ltest; [...]. Weird.

altonen · 2023-08-14T10:56:49Z

Trying to reproduce but unsuccessful so far. What command did you run? I'm doing
while cargo ltest test_paritydb_migrate_2_to_3; do :; done

running

while true; do cargo test || break; done

in node/service/src/parachain_db worked for me

mrcnski · 2023-08-14T12:10:37Z

I'm flummoxed, I don't see how these tests can be related to #7337. The tests and the error seem quite specific to paritydb.

called `Result::unwrap()` on an `Err` value: Io(Custom { kind: Other, error: "Error removing COL_SESSION_WINDOW_DATA Locked(Os { code: 35, kind: WouldBlock, message: \"Resource temporarily unavailable\" })" })

arkpar · 2023-08-14T12:47:58Z

If I understand correctly this only happens when there are multuple tests running? The error message here means that the database is already open. This could happen if mutiple tests use the same temp dir.

In any case, these upgrade tests are probably obsolete. I'd check with the polkadot team if it even makes sense to still maintain the upgrade code.
cc @ordian @sandreim

mrcnski · 2023-08-14T12:50:28Z

I've confirmed locally the issue was introduced by #7337. Weird.

ordian · 2023-08-14T12:52:41Z

If I understand correctly this only happens when there are multuple tests running? The error message here means that the database is already open. This could happen if mutiple tests use the same temp dir.

In any case, these upgrade tests are probably obsolete. I'd check with the polkadot team if it even makes sense to still maintain the upgrade code. cc @ordian @sandreim

Indeed, the test can be removed now.

altonen · 2023-08-14T13:02:55Z

I've confirmed locally the issue was introduced by #7337. Weird.

It is weird. Since the branch is still available, it would be possible to go through the entire history and check which commit caused the regression but if the tests can be removed, I'll create a PR

mrcnski · 2023-08-15T08:28:51Z

Yeah I'll go bisect the branch. Partly out of morbid curiosity and partly because I'm afraid there is a deeper issue that could reappear later.

mrcnski · 2023-08-15T11:18:16Z

The test failures were introduced in dc5d6a0 where we added some other tests in the same crate. The new tests could not run at the same time as each other¹, so I had added some code like this:

// Ensure tests run one at a time so they don't interfere with each other.
static OVERRIDES_LOCK: Mutex<()> = Mutex::new(());
let _lock = OVERRIDES_LOCK.lock()?;

in the next commit this was replaced with the serial_test crate which does the same thing and the test failures persisted.

My guess is that this affected the order that tests are run in, in such a way that caused paritydb tests to step on each other's toes.

Question: is this is expected behavior in paritydb or a bug that could manifest in production? I guess in production there is only ever one instance of paritydb? Maybe we should at least catch this error and throw a more descriptive one for the benefit of future devs? @ordian

though in retrospect there is a slight tweak that could be made so they don't step on each other's toes. ↩

arkpar · 2023-08-15T12:15:58Z

Question: is this is expected behavior in paritydb or a bug that could manifest in production? I guess in production there is only ever one instance of paritydb? Maybe we should at least catch this error and throw a more descriptive one for the benefit of future devs? @ordian

Mutiple database instances should not interfere unless they use the same path. Each test must use a unique temp dir. No serial_test or mutex hacks should be required.

mrcnski · 2023-08-15T15:00:45Z

@arkpar If I understand correctly, the paritydb tests which have now been removed were simply not written correctly, so there's no actual bug there? BTW the hacks were for tests related to #7337 (totally unrelated to paritydb), and on further inspection they do actually need to be there, using a unique temp dir is not enough.

arkpar · 2023-08-15T16:11:22Z

I don't see anything wrong with the tests. They should have worked. I've looked at the tempfile crate to see if it could produce the same path if called simulatneously from two different threads, but it seems to be allright.

I've tried reproducing the issue with while true; do cargo test || break; done, on macos but was unable to. So maybe the issue is platform dependent. I'd add some logs to see if the paths are indeeed different. setting RUST_LOG=parity-db=debug and running cargo test --release -- --nocapture could also shed some light.

ggwpez · 2023-08-15T17:14:37Z

Someone sent me this log output form a parachain DB migration:

Aug 11 13:14:21 region2-collator-us-1 systemd[1]: Started watr Systemd Service.
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: CLI parameter `--execution` has no effect anymore and will be removed in the future!
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.648  INFO main sc_cli::runner: Watr Collator
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: ✌️   version 1.0.0-6ec9229d97c
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: ❤️  by Parity Technologies <[email protected]>, 2023-2023
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: 📋 Chain specification: Watr Network
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: 🏷   Node name: region2-collator-us-1
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: 👤 Role: AUTHORITY
Aug 11 13:14:21 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:21.649  INFO main sc_cli::runner: 💾 Database: RocksDb at /home/polkadot/.local/share/polkadot/chains/watr_network/db/full
Aug 11 13:14:23 region2-collator-us-1 watr[2020990]: CLI parameter `--execution` has no effect anymore and will be removed in the future!
Aug 11 13:14:23 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:23.804  INFO main watr_node::command: Parachain id: Id(2058)
Aug 11 13:14:23 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:23.804  INFO main watr_node::command: Parachain Account: 5Ec4AhNoGMRn8j7rpcumbokVMzWeTNdqw42r7pcH8dtqH29G
Aug 11 13:14:23 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:23.804  INFO main watr_node::command: Parachain genesis state: 0x00000000000000000000000000000000000000000000000000000000000000000032d52a557fc0b1e3b6ce56981091c11c652646eb3b39733928b143f2993fe47003170a2e7597b7b7e3d84c05391d139a62b157e78786d8c082f29dcf4c11131400
Aug 11 13:14:23 region2-collator-us-1 watr[2020990]: 2023-08-11 13:14:23.804  INFO main watr_node::command: Is collating: yes
Aug 11 13:15:36 region2-collator-us-1 watr[2020990]: 2023-08-11 13:15:36.803  INFO main sub-libp2p: [Relaychain] 🏷   Local node identity is: 12D3KooWPELBXukq5cju5VKoKaFcRe5r1ti8jBTXeQAEoXZTjHyk
Aug 11 13:15:36 region2-collator-us-1 watr[2020990]: 2023-08-11 13:15:36.818  INFO main sub-libp2p: [Relaychain] 🔍 Discovered new external address for our node: /ip4/3.93.142.105/tcp/30333/p2p/12D3KooWPELBXukq5cju5VKoKaFcRe5r1ti8jBTXeQAEoXZTjHyk
Aug 11 13:15:36 region2-collator-us-1 watr[2020990]: 2023-08-11 13:15:36.821  INFO main parachain::db: [Relaychain] Migrating parachains db from version 1 to version 2 ...
Aug 11 13:15:50 region2-collator-us-1 watr[2020990]: 2023-08-11 13:15:50.022  INFO main parachain::db: [Relaychain] Migration complete!
Aug 11 13:15:50 region2-collator-us-1 watr[2020990]: 2023-08-11 13:15:50.025  WARN tokio-runtime-worker sc_service::builder: [Relaychain] The NetworkStart returned as part of `build_network` has been silently dropped
Aug 11 13:15:50 region2-collator-us-1 watr[2020990]: Error: Service(Application(Application(Io(Custom { kind: Other, error: Error { message: "Invalid argument: Column families not opened: col5, col4, col3, col2, col1, col0" } }))))
Aug 11 13:15:50 region2-collator-us-1 systemd[1]: watr.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 13:15:50 region2-collator-us-1 systemd[1]: watr.service: Failed with result 'exit-code'.

There is a Q in the General elements chat from Peter W. about it. Could this be related?

mrcnski · 2023-08-16T08:54:40Z

@arkpar I was able to reproduce on M1 Mac as well as Linux, but it is flaky. Try cargo clean first, otherwise try also the next commit 6ef14fe.

mrcnski · 2023-08-16T09:00:30Z

@ggwpez No idea if there is a connection but IMO should be investigated. Reopening.

arkpar · 2023-08-16T09:47:18Z

There is a Q in the General elements chat from Peter W. about it. Could this be related?

I don't think so. This is rocksdb error. If this can be reproduced please file a separate issue.

arkpar · 2023-08-17T17:10:28Z

Added an explicit DB unlock in paritytech/parity-db#218 which resolves this. Normally file locks should be removed automatically once the file handle is closed. For some reason when running tests the lock may take some time to disappear on its own.

altonen added the I5-tests Tests need fixing, improving or augmenting. label Aug 14, 2023

altonen mentioned this issue Aug 14, 2023

Remove ParityDb migration tests #7612

Merged

paritytech-processbot bot closed this as completed in #7612 Aug 14, 2023

mrcnski reopened this Aug 16, 2023

arkpar closed this as completed Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParityDb migration tests are flaky #7609

ParityDb migration tests are flaky #7609

altonen commented Aug 14, 2023

ggwpez commented Aug 14, 2023

mrcnski commented Aug 14, 2023

mrcnski commented Aug 14, 2023

altonen commented Aug 14, 2023

mrcnski commented Aug 14, 2023

arkpar commented Aug 14, 2023

mrcnski commented Aug 14, 2023

ordian commented Aug 14, 2023

altonen commented Aug 14, 2023

mrcnski commented Aug 15, 2023

mrcnski commented Aug 15, 2023

arkpar commented Aug 15, 2023 •

edited

Loading

mrcnski commented Aug 15, 2023

arkpar commented Aug 15, 2023

ggwpez commented Aug 15, 2023

mrcnski commented Aug 16, 2023

mrcnski commented Aug 16, 2023

arkpar commented Aug 16, 2023

arkpar commented Aug 17, 2023

ParityDb migration tests are flaky #7609

ParityDb migration tests are flaky #7609

Comments

altonen commented Aug 14, 2023

ggwpez commented Aug 14, 2023

mrcnski commented Aug 14, 2023

mrcnski commented Aug 14, 2023

altonen commented Aug 14, 2023

mrcnski commented Aug 14, 2023

arkpar commented Aug 14, 2023

mrcnski commented Aug 14, 2023

ordian commented Aug 14, 2023

altonen commented Aug 14, 2023

mrcnski commented Aug 15, 2023

mrcnski commented Aug 15, 2023

Footnotes

arkpar commented Aug 15, 2023 • edited Loading

mrcnski commented Aug 15, 2023

arkpar commented Aug 15, 2023

ggwpez commented Aug 15, 2023

mrcnski commented Aug 16, 2023

mrcnski commented Aug 16, 2023

arkpar commented Aug 16, 2023

arkpar commented Aug 17, 2023

arkpar commented Aug 15, 2023 •

edited

Loading