Skip to content

Fix sqlite tests#2636

Closed
dmytro-pryvedeniuk wants to merge 2 commits intoJasperFx:mainfrom
dmytro-pryvedeniuk:fix-sqlite-tests
Closed

Fix sqlite tests#2636
dmytro-pryvedeniuk wants to merge 2 commits intoJasperFx:mainfrom
dmytro-pryvedeniuk:fix-sqlite-tests

Conversation

@dmytro-pryvedeniuk
Copy link
Copy Markdown
Contributor

@dmytro-pryvedeniuk dmytro-pryvedeniuk commented Apr 30, 2026

This PR fixes sqlite tests.

In scheduled_messages_are_processed_in_tenant_files we schedule messages with 2s delay, wait 300ms and check whether the messages are received. It turns out that when the sending code gets to the messages they are considered not "scheduled for later time" but "to be sent immediately" as the scheduled time is before current time.

The simple fix would be just getting rid of this check or increasing schedule delay. For this test the main interest should be that the messages are received by the correct tenants.

On the other hand sqlite advisory lock is created after several delayed retries to attain migration lock. These attempts fail as wolverine_locks table does not exist before migration. At the end migration is executed without the lock but the table is created only for default tenant. As a fix I moved table creation to SqliteAdvisoryLock - it's executed before an attempt to attain the lock (for each tenant).

Other fixed issues:

  • SqliteMessageStore.TryAttainLockAsync wrongly assumes that if the SQL command is executed the lock is attained, but actually SQL is 'INSERT OR IGNORE' so it must check the affected record. Fixed delegating locking to the used SqliteAdvisoryLock. The downside is that the passed connection is ignored as SqliteAdvisoryLock has own connection. Can it be a problem?
  • SqliteMessageStore does not release the lock acquired in TryAttainLockAsync. It means the migration lock remains in the table. Fixed by implementing ReleaseLockAsync. Again, the passed connection is ignored as well as the cancellation token.
  • SqliteAdvisoryLock.TryAttainLockAsync is not idempotent. It returns false second time even though HasLock is true. Fixed calling HasLock from TryAttainLockAsync.
  • SqliteAdvisoryLock.DisposeAsync throws NullReferenceException as ReleaseLockAsync called above nulls the connection when there is no lock anymore. Fixed by removing failing code (finally section handles it already).

All this unblocks sqlite tests and makes them faster (~2.7mins vs ~1min).

@jeremydmiller There are open questions:

  1. What do you think about these changes in general?
  2. Is it ok to delegate locking to SqliteAdvisoryLock ignoring the passed connection?
  3. Any idea how to handle the stale locks? See should_not_attain_lock_when_previous_owner_crashes_without_releasing test. It shows that if the app crashes the lock remains. Maybe some file-based locks instead (or in addition) or some background cleaner based on node id.
  4. SqliteAdvisoryLock uses db connection that is kept open while it holds any lock and the state of the connection is used for decision making (e.g. in TryAttainLockAsync if the connection is closed it returns false). Not sure how long a lock is supposed to be in use. IMO it's better to open new connection each time, use and dispose returning to the connection pool. Does it make sense?

@jeremydmiller
Copy link
Copy Markdown
Member

@dmytro-pryvedeniuk I'm good with these changes for now. In retrospect, it really doesn't make any sense to even have the locking on sqlite as you can't run it in a cluster anyway, right?

@mysticmind, what do you think?

I'm going to hold off on this for the next release though just to get bug fixes out

@dmytro-pryvedeniuk
Copy link
Copy Markdown
Contributor Author

@jeremydmiller I read https://wolverinefx.net/guide/durability/sqlite.html#sqlite-messaging-transport as "multiple processes can use the same DB taking into account SQLite single-writer limitation". So the locking is needed for migration at least (or not, if anyway the lock is ignored after retries). Not sure about other use cases (current or future) though.

Does not this (https://github.com/JasperFx/wolverine/blob/e3caa7d614f3dd0ff01cbbf3c85bb831ea2d3bdd/src/Persistence/SqliteTests/message_store_initialization_and_configuration.cs) mean that the multiple nodes are possible?

@jeremydmiller
Copy link
Copy Markdown
Member

@dmytro-pryvedeniuk I'm going to punt a bit and ask @mysticmind to review this as he has vastly more Sqlite experience than I do.

Babu, thank you in advance!

@mysticmind
Copy link
Copy Markdown
Member

I haven't got a chance to look at this, will do in the coming week and revert.

@mysticmind
Copy link
Copy Markdown
Member

mysticmind commented May 3, 2026

Does not this (https://github.com/JasperFx/wolverine/blob/e3caa7d614f3dd0ff01cbbf3c85bb831ea2d3bdd/src/Persistence/SqliteTests/message_store_initialization_and_configuration.cs) mean that the multiple nodes are possible?

Docs clearly states that it is single node usage. You can't scale/doesn't work for multi-node usage. Using a single process is the right usage for SQLite.

@mysticmind
Copy link
Copy Markdown
Member

mysticmind commented May 3, 2026

Hi @dmytro-pryvedeniuk — really appreciated this PR. Your analysis pinned down several real bugs and the test that motivated it. I've taken your work as the starting point and pushed an alternative approach in a separate PR (link to follow): fix(sqlite): use BEGIN EXCLUSIVE for migration lock. The key difference is on bug no.2 (the wolverine_locks chicken-and-egg). Rather than creating the table from inside SqliteAdvisoryLock.TryAttainLockAsync which adds a CREATE TABLE IF NOT EXISTS to a hot polling path that fires every 200 ms per tenant. I split the migration lock from the polling lock entirely:

  • Migration lock uses BEGIN EXCLUSIVE TRANSACTION. No schema dependency, so the chicken-and-egg disappears. As a bonus, the OS releases the file-level lock automatically on process death, which sidesteps your open question Q3 about stale locks.
  • Polling lock continues to use the wolverine_locks row scheme. By the time polling runs, migration has already created the table.

Implementation-wise this required making acquireMigrationLockAsync protected virtual on MessageDatabase<T> and adding a parallel releaseMigrationLockAsync virtual hook so SQLite can substitute its own primitive without touching the Postgres / SQL Server / RavenDb paths.

What I kept directly from your PR (with credit in the commit body):

  • SqliteAdvisoryLock.TryAttainLockAsync idempotency via HasLock short-circuit (your bug no.5).
  • SqliteAdvisoryLock.DisposeAsync NRE fix — dropped the duplicate close-and-dispose lines (your bug no.6).
  • SqliteMessageStore.TryAttainLockAsync delegating to SqliteAdvisoryLock so the affected-rows check is honored (your bug no.3).
  • New SqliteMessageStore.ReleaseLockAsync override so the lock row actually gets deleted (your bug no.4).
  • Dropping the racy Task.Delay(300ms) + ShouldBeFalse block in scheduled_messages_are_processed_in_tenant_files (your bug no.1).

Tests:

  • 4 focused new tests in sqlite_migration_lock.cs covering: migration leaves no row in wolverine_locks (proves BEGIN EXCLUSIVE is in effect), two hosts start concurrently against the same file in <1 s (no retry storm), TryAttainLockAsync idempotency, ReleaseLockAsync deletes the row.
  • Full SqliteTests suite: 378/378 pass on net8.0/net9.0/net10.0, ~1m 14s wall-clock.

Open question Q4 (long-lived vs per-call connection in SqliteAdvisoryLock): I left it as long-lived — keeps the HasLock ping cheap and the lock-row ownership unambiguous.
Open question Q2 (ignoring the caller's DbConnection): same conclusion as you, with the structural justification that SQLite's lock primitive doesn't need a shared connection.

@jeremydmiller In retrospect, I am also thinking that SQLite provider may not be used much in production and messaging layer does not come into play in really small apps. We may have to deprecate support for SQLite all together to reduce spending time and effort on maintaining this.

@dmytro-pryvedeniuk
Copy link
Copy Markdown
Contributor Author

@mysticmind Your solution is better, efcore also uses an exclusive lock for migration. What about non-migration locks, they still can be left stale, right? Reg. single node restriction does it mean the same as "single process"?

@mysticmind
Copy link
Copy Markdown
Member

Good catches, both right.

Stale non-migration locks: yes, the row-based wolverine_locks scheme can leave stale rows on a hard crash. The row isn't tied to the writing connection (unlike the new BEGIN EXCLUSIVE migration lock, which SQLite tears down with the connection). I just pushed a fix on fix/sqlite-migration-lock (ff0f3f8) that adds:

  • A TTL sweep in TryAttainLockAsync: DELETE FROM wolverine_locks WHERE lock_id = @id AND acquired_at < @cutoff before the INSERT OR IGNORE.
  • An implicit heartbeat: live holders re-attempt every poll tick (HealthCheckPollingTime for the leadership lock, ScheduledJobPollingTime for recovery/external-table locks), and the idempotent path now UPDATEs acquired_at so live holders are never reaped.

Default TTL 2 min, comfortably above the 10 s heartbeat cadence. Dead holders unblock peers within one TTL window of the next attempt.

"Single node" vs "single process": for the SQLite provider they're effectively the same. SQLite the engine permits multiple processes on one .db, but Wolverine on top assumes one host per file:

  • The polling-lock scheme can't recover from a peer crash without the TTL sweep above (and even with it, two processes contending on a single SQLite file will hit SQLITE_BUSY retry storms under load).
  • can_send_from_one_node_to_another_by_destination compliance tests was skipped on the SQLite local fixture in 9750cb4 precisely because the SQLite setup models a single node.
  • Multi-tenancy splits per-tenant files across one host; it doesn't enable a second host.

So: one Wolverine process per file. For multi-process or true multi-node, the docs steer to Postgres / SQL Server.

@dmytro-pryvedeniuk
Copy link
Copy Markdown
Contributor Author

Nice, "one file-one process" is simpler and safer as writes are needed. Do you think we can make it clearer in documentation? I see NodeReassignmentPollingTime in a sample, dbcontrol queue, sqlite transport as a feature. All this can make someone to think that at least multiple processes per file are supported.

@mysticmind
Copy link
Copy Markdown
Member

Sure, will recheck the samples and sort it out.

jeremydmiller pushed a commit that referenced this pull request May 4, 2026
…tbeat for non-migration advisory locks (#2666)

* fix(sqlite): use BEGIN EXCLUSIVE for migration lock

The row-based wolverine_locks scheme couldn't serialize migration: the
table is created by the migration it's supposed to protect, so the first
migration per tenant burned ~5.5s of failed lock retries before running
unprotected. Migration now uses BEGIN EXCLUSIVE; polling keeps the row
lock (by then the table exists).

From #2636 (dmytro-pryvedeniuk):
- SqliteAdvisoryLock.TryAttainLockAsync now idempotent via HasLock
- SqliteAdvisoryLock.DisposeAsync NRE: dropped duplicate close+dispose
- SqliteMessageStore polling overrides delegate to SqliteAdvisoryLock
  (TryAttain checks affected rows; Release actually deletes the row)
- Dropped racy "not yet delivered" assertion in scheduled-tenant test

Not picked: CreateLocksTableIfMissing in the lock hot path -- unneeded
once migration uses BEGIN EXCLUSIVE.

Also skips can_send_from_one_node_to_another_by_destination on the
SQLite local fixture (single-host, no second node) and adds 4 focused
migration-lock tests.

* fix(sqlite): TTL sweep + heartbeat for non-migration advisory locks

wolverine_locks rows aren't bound to the writing connection, so a
hard-killed holder leaves a row no peer reaps. Sweep stale rows on
attempt; refresh acquired_at when the live holder re-attains. Live
holders re-attempt on every poll tick (HealthCheck/ScheduledJob), so
the heartbeat is implicit. Default TTL 2m.

Split sqlite_migration_lock.cs: migration-lock tests stay; advisory-
lock tests (idempotency, release, TTL/heartbeat) move to a new
sqlite_advisory_lock.cs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants