Skip to content

Retry transient catalog-concurrency errors in PostgreSQL migration DDL (closes #293)#294

Merged
jeremydmiller merged 1 commit into
masterfrom
fix/293-migration-ddl-concurrency-retry
May 24, 2026
Merged

Retry transient catalog-concurrency errors in PostgreSQL migration DDL (closes #293)#294
jeremydmiller merged 1 commit into
masterfrom
fix/293-migration-ddl-concurrency-retry

Conversation

@jeremydmiller
Copy link
Copy Markdown
Member

Closes #293. Follow-up to #282.

Problem

#282 (6b6d296) made only CREATE SCHEMA concurrent-safe — a plpgsql sub-block catching 42P06/23505. The rest of the migration DDL emitted by migration.WriteAllUpdates(...) (CREATE OR REPLACE FUNCTION mt_immutable_*, CREATE TABLE IF NOT EXISTS, …) stayed unguarded. When two sessions lazily EnsureStorageExists against the same database, the shared catalog DDL races and one backend gets:

Npgsql.PostgresException : XX000: tuple concurrently updated

That SQLSTATE isn't caught by #282, so it propagates. Surfaced as a Marten conjoined multi-tenant CI flake (query_before_saving, downstream JasperFx/marten#4552).

Why retry is the right fix

Two properties of PostgresqlMigrator.executeDelta make a bounded retry both safe and minimal:

  1. The DDL is idempotentIF NOT EXISTS / CREATE OR REPLACE throughout.
  2. Each statement auto-commits independentlyexecuteDelta never sets cmd.Transaction, so a failed statement rolls back fully and a retry re-runs from a clean slate. (CREATE INDEX CONCURRENTLY is already split into its own chunk, so the main body is a transaction-safe block.)

A wider DO-block guard doesn't generalize — the post-schema DDL isn't in one catchable block, and CONCURRENTLY can't be. Retry is the issue's recommended option and the general fix.

Change

Wrap each cmd.ExecuteNonQueryAsync in a bounded retry (3 attempts, jittered sub-100ms backoff), inside the existing try/catch so the established logger.OnFailure / rethrow path is untouched after exhaustion. Retry only on the transient, retry-safe SQLSTATEs:

SQLSTATE Meaning Match
40001 serialization_failure by code
40P01 deadlock_detected by code
XX000 internal_error only when message is tuple concurrently updated (XX000 is a catch-all)

Kept PG-specific (these codes are PostgreSQL's; executeDelta is a per-provider override). The classification is a pure internal predicate IsTransientCatalogConcurrency(sqlState, messageText).

Test plan

  • PostgresqlMigratorConcurrencyRetryTests — 12 assertions on the predicate: 40001/40P01 by code; XX000 matches on message (case-insensitive, substring), rejects unrelated XX000 messages + null; unrelated SQLSTATEs (incl. the Make CREATE SCHEMA migration DDL concurrent-safe against pg_namespace race #282-handled 42P06/23505) not retried. The catalog race itself isn't deterministically testable, so the classification is what's covered.
  • Happy-path integration unchanged: migration-scenario / schema-creation / DatabaseWithTables / Bug983 suites 21/22 (1 pre-existing skip) against PostgreSQL.

Downstream JasperFx/marten#4552 stabilizes once Marten consumes a Weasel build with this (an rc.2 candidate).

🤖 Generated with Claude Code

closes #293)

Follow-up to #282. That fix wrapped only CREATE SCHEMA in a plpgsql block
catching 42P06/23505. The rest of the migration DDL emitted by
migration.WriteAllUpdates(...) — CREATE OR REPLACE FUNCTION mt_immutable_*,
CREATE TABLE IF NOT EXISTS, etc. — stayed unguarded, so concurrent lazy
EnsureStorageExists against the same database could fail with
"XX000: tuple concurrently updated" (two backends updating the same
pg_proc / catalog row at once). Seen as a Marten conjoined multi-tenant CI
flake (JasperFx/marten#4552).

The DDL Weasel emits is idempotent (IF NOT EXISTS / CREATE OR REPLACE) and
executeDelta runs each statement on the bare connection (autocommit, no
ambient transaction), so a statement that lost a catalog race is safe to
re-run from a clean slate. Wrap each cmd.ExecuteNonQueryAsync in a bounded
retry (3 attempts, jittered sub-100ms backoff) keyed on the transient
SQLSTATEs:

  - 40001 serialization_failure
  - 40P01 deadlock_detected
  - XX000 internal_error ONLY when the message is "tuple concurrently
    updated" (XX000 is a catch-all, so the message guard avoids
    blanket-retrying unrelated internal errors)

The retry sits inside the existing try/catch, so after exhausting attempts
the established logger.OnFailure / rethrow path is unchanged. Kept
PG-specific (these SQLSTATEs are PostgreSQL's; executeDelta is a per-provider
override).

The classification is extracted as a pure internal predicate
IsTransientCatalogConcurrency(sqlState, messageText) and covered by 12 unit
assertions (the race itself isn't deterministically testable). Happy-path
migration/schema integration tests pass unchanged (21/22, 1 pre-existing
skip). Downstream Marten#4552 stabilizes once it consumes a Weasel build
with this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremydmiller jeremydmiller merged commit 60a1265 into master May 24, 2026
21 checks passed
@jeremydmiller jeremydmiller deleted the fix/293-migration-ddl-concurrency-retry branch May 24, 2026 01:20
jeremydmiller added a commit that referenced this pull request May 24, 2026
First patch release on the 9.0 line. Ships:
- #294 (closes #293) — retry transient catalog-concurrency errors (XX000
  "tuple concurrently updated", 40001, 40P01) in PostgreSQL migration DDL
- #295 (closes #290, #291) — EF Core schema mapping: skip Npgsql
  IsRowVersion()->xmin system column; emit ComplexProperty/ComplexCollection
  .ToJson() container columns (EF Core 10)
- #296 — JasperFx 2.0.0 -> 2.0.1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller added a commit that referenced this pull request May 28, 2026
…igrator)

The bounded retry in PostgresqlMigrator.executeWithConcurrencyRetryAsync
(#293 / #294) re-invokes cmd.ExecuteNonQueryAsync on the same DbCommand
after a transient PostgresException, but doesn't account for Npgsql
moving the underlying connection to Closed or Broken when the
server-side error has aborted the session (40P01 deadlock_detected and
XX000 "tuple concurrently updated" in particular). The retry then
throws InvalidOperationException("Connection is not open") — which
isn't a PostgresException, so it slips past the catch filter and
surfaces to the caller as a hard migration failure, defeating the
whole point of the retry.

Repro: recurring intermittent failure on
EventSourcingTests.end_to_end_event_capture_and_fetching_the_stream.
query_before_saving(tenancyStyle: Conjoined) in JasperFx/marten PRs
#4576, #4578, #4582, #4584 — all hit this exact path.

Fix: after the backoff delay, call EnsureConnectionOpenAsync — a small
internal helper that puts the connection back into Open state before
the next ExecuteNonQueryAsync attempt. Broken connections require a
Close before OpenAsync (OpenAsync on Broken throws).

Tests: 4 new unit tests in PostgresqlMigratorConcurrencyRetryTests
exercise the reopen rules deterministically against a fake DbCommand /
DbConnection — Open is a no-op, Closed reopens, Broken closes-then-reopens,
null Connection short-circuits. The retry loop itself races on the
catalog and can't be exercised deterministically, but the reopen
helper is a pure function over (DbCommand, ConnectionState) and is
fully testable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller added a commit that referenced this pull request May 28, 2026
…igrator) (#299)

The bounded retry in PostgresqlMigrator.executeWithConcurrencyRetryAsync
(#293 / #294) re-invokes cmd.ExecuteNonQueryAsync on the same DbCommand
after a transient PostgresException, but doesn't account for Npgsql
moving the underlying connection to Closed or Broken when the
server-side error has aborted the session (40P01 deadlock_detected and
XX000 "tuple concurrently updated" in particular). The retry then
throws InvalidOperationException("Connection is not open") — which
isn't a PostgresException, so it slips past the catch filter and
surfaces to the caller as a hard migration failure, defeating the
whole point of the retry.

Repro: recurring intermittent failure on
EventSourcingTests.end_to_end_event_capture_and_fetching_the_stream.
query_before_saving(tenancyStyle: Conjoined) in JasperFx/marten PRs
#4576, #4578, #4582, #4584 — all hit this exact path.

Fix: after the backoff delay, call EnsureConnectionOpenAsync — a small
internal helper that puts the connection back into Open state before
the next ExecuteNonQueryAsync attempt. Broken connections require a
Close before OpenAsync (OpenAsync on Broken throws).

Tests: 4 new unit tests in PostgresqlMigratorConcurrencyRetryTests
exercise the reopen rules deterministically against a fake DbCommand /
DbConnection — Open is a no-op, Closed reopens, Broken closes-then-reopens,
null Connection short-circuits. The retry loop itself races on the
catalog and can't be exercised deterministically, but the reopen
helper is a pure function over (DbCommand, ConnectionState) and is
fully testable.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migration DDL still races under concurrent EnsureStorageExists: XX000 "tuple concurrently updated" (follow-up to #282)

1 participant