Fix sqlite tests#2636
Conversation
|
@dmytro-pryvedeniuk I'm good with these changes for now. In retrospect, it really doesn't make any sense to even have the locking on sqlite as you can't run it in a cluster anyway, right? @mysticmind, what do you think? I'm going to hold off on this for the next release though just to get bug fixes out |
|
@jeremydmiller I read https://wolverinefx.net/guide/durability/sqlite.html#sqlite-messaging-transport as "multiple processes can use the same DB taking into account SQLite single-writer limitation". So the locking is needed for migration at least (or not, if anyway the lock is ignored after retries). Not sure about other use cases (current or future) though. Does not this (https://github.com/JasperFx/wolverine/blob/e3caa7d614f3dd0ff01cbbf3c85bb831ea2d3bdd/src/Persistence/SqliteTests/message_store_initialization_and_configuration.cs) mean that the multiple nodes are possible? |
|
@dmytro-pryvedeniuk I'm going to punt a bit and ask @mysticmind to review this as he has vastly more Sqlite experience than I do. Babu, thank you in advance! |
|
I haven't got a chance to look at this, will do in the coming week and revert. |
Docs clearly states that it is single node usage. You can't scale/doesn't work for multi-node usage. Using a single process is the right usage for SQLite. |
|
Hi @dmytro-pryvedeniuk — really appreciated this PR. Your analysis pinned down several real bugs and the test that motivated it. I've taken your work as the starting point and pushed an alternative approach in a separate PR (link to follow): fix(sqlite): use BEGIN EXCLUSIVE for migration lock. The key difference is on bug no.2 (the
Implementation-wise this required making What I kept directly from your PR (with credit in the commit body):
Tests:
Open question Q4 (long-lived vs per-call connection in @jeremydmiller In retrospect, I am also thinking that SQLite provider may not be used much in production and messaging layer does not come into play in really small apps. We may have to deprecate support for SQLite all together to reduce spending time and effort on maintaining this. |
|
@mysticmind Your solution is better, efcore also uses an exclusive lock for migration. What about non-migration locks, they still can be left stale, right? Reg. single node restriction does it mean the same as "single process"? |
|
Good catches, both right. Stale non-migration locks: yes, the row-based wolverine_locks scheme can leave stale rows on a hard crash. The row isn't tied to the writing connection (unlike the new BEGIN EXCLUSIVE migration lock, which SQLite tears down with the connection). I just pushed a fix on fix/sqlite-migration-lock (ff0f3f8) that adds:
Default TTL 2 min, comfortably above the 10 s heartbeat cadence. Dead holders unblock peers within one TTL window of the next attempt. "Single node" vs "single process": for the SQLite provider they're effectively the same. SQLite the engine permits multiple processes on one .db, but Wolverine on top assumes one host per file:
So: one Wolverine process per file. For multi-process or true multi-node, the docs steer to Postgres / SQL Server. |
|
Nice, "one file-one process" is simpler and safer as writes are needed. Do you think we can make it clearer in documentation? I see NodeReassignmentPollingTime in a sample, dbcontrol queue, sqlite transport as a feature. All this can make someone to think that at least multiple processes per file are supported. |
|
Sure, will recheck the samples and sort it out. |
…tbeat for non-migration advisory locks (#2666) * fix(sqlite): use BEGIN EXCLUSIVE for migration lock The row-based wolverine_locks scheme couldn't serialize migration: the table is created by the migration it's supposed to protect, so the first migration per tenant burned ~5.5s of failed lock retries before running unprotected. Migration now uses BEGIN EXCLUSIVE; polling keeps the row lock (by then the table exists). From #2636 (dmytro-pryvedeniuk): - SqliteAdvisoryLock.TryAttainLockAsync now idempotent via HasLock - SqliteAdvisoryLock.DisposeAsync NRE: dropped duplicate close+dispose - SqliteMessageStore polling overrides delegate to SqliteAdvisoryLock (TryAttain checks affected rows; Release actually deletes the row) - Dropped racy "not yet delivered" assertion in scheduled-tenant test Not picked: CreateLocksTableIfMissing in the lock hot path -- unneeded once migration uses BEGIN EXCLUSIVE. Also skips can_send_from_one_node_to_another_by_destination on the SQLite local fixture (single-host, no second node) and adds 4 focused migration-lock tests. * fix(sqlite): TTL sweep + heartbeat for non-migration advisory locks wolverine_locks rows aren't bound to the writing connection, so a hard-killed holder leaves a row no peer reaps. Sweep stale rows on attempt; refresh acquired_at when the live holder re-attains. Live holders re-attempt on every poll tick (HealthCheck/ScheduledJob), so the heartbeat is implicit. Default TTL 2m. Split sqlite_migration_lock.cs: migration-lock tests stay; advisory- lock tests (idempotency, release, TTL/heartbeat) move to a new sqlite_advisory_lock.cs.
This PR fixes sqlite tests.
In
scheduled_messages_are_processed_in_tenant_fileswe schedule messages with 2s delay, wait 300ms and check whether the messages are received. It turns out that when the sending code gets to the messages they are considered not "scheduled for later time" but "to be sent immediately" as the scheduled time is before current time.The simple fix would be just getting rid of this check or increasing schedule delay. For this test the main interest should be that the messages are received by the correct tenants.
On the other hand sqlite advisory lock is created after several delayed retries to attain migration lock. These attempts fail as
wolverine_lockstable does not exist before migration. At the end migration is executed without the lock but the table is created only for default tenant. As a fix I moved table creation toSqliteAdvisoryLock- it's executed before an attempt to attain the lock (for each tenant).Other fixed issues:
SqliteMessageStore.TryAttainLockAsyncwrongly assumes that if the SQL command is executed the lock is attained, but actually SQL is 'INSERT OR IGNORE' so it must check the affected record. Fixed delegating locking to the usedSqliteAdvisoryLock. The downside is that the passed connection is ignored asSqliteAdvisoryLockhas own connection. Can it be a problem?SqliteMessageStoredoes not release the lock acquired inTryAttainLockAsync. It means the migration lock remains in the table. Fixed by implementingReleaseLockAsync. Again, the passed connection is ignored as well as the cancellation token.SqliteAdvisoryLock.TryAttainLockAsyncis not idempotent. It returnsfalsesecond time even thoughHasLockistrue. Fixed callingHasLockfromTryAttainLockAsync.SqliteAdvisoryLock.DisposeAsyncthrowsNullReferenceExceptionasReleaseLockAsynccalled above nulls the connection when there is no lock anymore. Fixed by removing failing code (finally section handles it already).All this unblocks sqlite tests and makes them faster (~2.7mins vs ~1min).
@jeremydmiller There are open questions:
SqliteAdvisoryLockignoring the passed connection?should_not_attain_lock_when_previous_owner_crashes_without_releasingtest. It shows that if the app crashes the lock remains. Maybe some file-based locks instead (or in addition) or some background cleaner based on node id.SqliteAdvisoryLockuses db connection that is kept open while it holds any lock and the state of the connection is used for decision making (e.g. inTryAttainLockAsyncif the connection is closed it returns false). Not sure how long a lock is supposed to be in use. IMO it's better to open new connection each time, use and dispose returning to the connection pool. Does it make sense?