Retry transient catalog-concurrency errors in PostgreSQL migration DDL (closes #293)#294
Merged
Merged
Conversation
closes #293) Follow-up to #282. That fix wrapped only CREATE SCHEMA in a plpgsql block catching 42P06/23505. The rest of the migration DDL emitted by migration.WriteAllUpdates(...) — CREATE OR REPLACE FUNCTION mt_immutable_*, CREATE TABLE IF NOT EXISTS, etc. — stayed unguarded, so concurrent lazy EnsureStorageExists against the same database could fail with "XX000: tuple concurrently updated" (two backends updating the same pg_proc / catalog row at once). Seen as a Marten conjoined multi-tenant CI flake (JasperFx/marten#4552). The DDL Weasel emits is idempotent (IF NOT EXISTS / CREATE OR REPLACE) and executeDelta runs each statement on the bare connection (autocommit, no ambient transaction), so a statement that lost a catalog race is safe to re-run from a clean slate. Wrap each cmd.ExecuteNonQueryAsync in a bounded retry (3 attempts, jittered sub-100ms backoff) keyed on the transient SQLSTATEs: - 40001 serialization_failure - 40P01 deadlock_detected - XX000 internal_error ONLY when the message is "tuple concurrently updated" (XX000 is a catch-all, so the message guard avoids blanket-retrying unrelated internal errors) The retry sits inside the existing try/catch, so after exhausting attempts the established logger.OnFailure / rethrow path is unchanged. Kept PG-specific (these SQLSTATEs are PostgreSQL's; executeDelta is a per-provider override). The classification is extracted as a pure internal predicate IsTransientCatalogConcurrency(sqlState, messageText) and covered by 12 unit assertions (the race itself isn't deterministically testable). Happy-path migration/schema integration tests pass unchanged (21/22, 1 pre-existing skip). Downstream Marten#4552 stabilizes once it consumes a Weasel build with this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller
added a commit
that referenced
this pull request
May 24, 2026
First patch release on the 9.0 line. Ships: - #294 (closes #293) — retry transient catalog-concurrency errors (XX000 "tuple concurrently updated", 40001, 40P01) in PostgreSQL migration DDL - #295 (closes #290, #291) — EF Core schema mapping: skip Npgsql IsRowVersion()->xmin system column; emit ComplexProperty/ComplexCollection .ToJson() container columns (EF Core 10) - #296 — JasperFx 2.0.0 -> 2.0.1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller
added a commit
that referenced
this pull request
May 28, 2026
…igrator) The bounded retry in PostgresqlMigrator.executeWithConcurrencyRetryAsync (#293 / #294) re-invokes cmd.ExecuteNonQueryAsync on the same DbCommand after a transient PostgresException, but doesn't account for Npgsql moving the underlying connection to Closed or Broken when the server-side error has aborted the session (40P01 deadlock_detected and XX000 "tuple concurrently updated" in particular). The retry then throws InvalidOperationException("Connection is not open") — which isn't a PostgresException, so it slips past the catch filter and surfaces to the caller as a hard migration failure, defeating the whole point of the retry. Repro: recurring intermittent failure on EventSourcingTests.end_to_end_event_capture_and_fetching_the_stream. query_before_saving(tenancyStyle: Conjoined) in JasperFx/marten PRs #4576, #4578, #4582, #4584 — all hit this exact path. Fix: after the backoff delay, call EnsureConnectionOpenAsync — a small internal helper that puts the connection back into Open state before the next ExecuteNonQueryAsync attempt. Broken connections require a Close before OpenAsync (OpenAsync on Broken throws). Tests: 4 new unit tests in PostgresqlMigratorConcurrencyRetryTests exercise the reopen rules deterministically against a fake DbCommand / DbConnection — Open is a no-op, Closed reopens, Broken closes-then-reopens, null Connection short-circuits. The retry loop itself races on the catalog and can't be exercised deterministically, but the reopen helper is a pure function over (DbCommand, ConnectionState) and is fully testable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller
added a commit
that referenced
this pull request
May 28, 2026
…igrator) (#299) The bounded retry in PostgresqlMigrator.executeWithConcurrencyRetryAsync (#293 / #294) re-invokes cmd.ExecuteNonQueryAsync on the same DbCommand after a transient PostgresException, but doesn't account for Npgsql moving the underlying connection to Closed or Broken when the server-side error has aborted the session (40P01 deadlock_detected and XX000 "tuple concurrently updated" in particular). The retry then throws InvalidOperationException("Connection is not open") — which isn't a PostgresException, so it slips past the catch filter and surfaces to the caller as a hard migration failure, defeating the whole point of the retry. Repro: recurring intermittent failure on EventSourcingTests.end_to_end_event_capture_and_fetching_the_stream. query_before_saving(tenancyStyle: Conjoined) in JasperFx/marten PRs #4576, #4578, #4582, #4584 — all hit this exact path. Fix: after the backoff delay, call EnsureConnectionOpenAsync — a small internal helper that puts the connection back into Open state before the next ExecuteNonQueryAsync attempt. Broken connections require a Close before OpenAsync (OpenAsync on Broken throws). Tests: 4 new unit tests in PostgresqlMigratorConcurrencyRetryTests exercise the reopen rules deterministically against a fake DbCommand / DbConnection — Open is a no-op, Closed reopens, Broken closes-then-reopens, null Connection short-circuits. The retry loop itself races on the catalog and can't be exercised deterministically, but the reopen helper is a pure function over (DbCommand, ConnectionState) and is fully testable. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #293. Follow-up to #282.
Problem
#282 (
6b6d296) made onlyCREATE SCHEMAconcurrent-safe — a plpgsql sub-block catching42P06/23505. The rest of the migration DDL emitted bymigration.WriteAllUpdates(...)(CREATE OR REPLACE FUNCTION mt_immutable_*,CREATE TABLE IF NOT EXISTS, …) stayed unguarded. When two sessions lazilyEnsureStorageExistsagainst the same database, the shared catalog DDL races and one backend gets:That SQLSTATE isn't caught by #282, so it propagates. Surfaced as a Marten conjoined multi-tenant CI flake (
query_before_saving, downstream JasperFx/marten#4552).Why retry is the right fix
Two properties of
PostgresqlMigrator.executeDeltamake a bounded retry both safe and minimal:IF NOT EXISTS/CREATE OR REPLACEthroughout.executeDeltanever setscmd.Transaction, so a failed statement rolls back fully and a retry re-runs from a clean slate. (CREATE INDEX CONCURRENTLYis already split into its own chunk, so the main body is a transaction-safe block.)A wider DO-block guard doesn't generalize — the post-schema DDL isn't in one catchable block, and
CONCURRENTLYcan't be. Retry is the issue's recommended option and the general fix.Change
Wrap each
cmd.ExecuteNonQueryAsyncin a bounded retry (3 attempts, jittered sub-100ms backoff), inside the existing try/catch so the establishedlogger.OnFailure/ rethrow path is untouched after exhaustion. Retry only on the transient, retry-safe SQLSTATEs:4000140P01XX000tuple concurrently updated(XX000 is a catch-all)Kept PG-specific (these codes are PostgreSQL's;
executeDeltais a per-provider override). The classification is a pure internal predicateIsTransientCatalogConcurrency(sqlState, messageText).Test plan
PostgresqlMigratorConcurrencyRetryTests— 12 assertions on the predicate: 40001/40P01 by code; XX000 matches on message (case-insensitive, substring), rejects unrelated XX000 messages + null; unrelated SQLSTATEs (incl. the Make CREATE SCHEMA migration DDL concurrent-safe against pg_namespace race #282-handled 42P06/23505) not retried. The catalog race itself isn't deterministically testable, so the classification is what's covered.DatabaseWithTables/ Bug983 suites 21/22 (1 pre-existing skip) against PostgreSQL.Downstream JasperFx/marten#4552 stabilizes once Marten consumes a Weasel build with this (an rc.2 candidate).
🤖 Generated with Claude Code