Skip to content

fix(binlog): add GC safepoint for binlog producer to prevent chunk deletion during traversal#10604

Closed
scbrown wants to merge 2 commits intodolthub:mainfrom
scbrown:fix/binlog-producer-gc-safepoint
Closed

fix(binlog): add GC safepoint for binlog producer to prevent chunk deletion during traversal#10604
scbrown wants to merge 2 commits intodolthub:mainfrom
scbrown:fix/binlog-producer-gc-safepoint

Conversation

@scbrown
Copy link
Copy Markdown

@scbrown scbrown commented Mar 2, 2026

Fixes #10602

Summary

The binlog producer traverses Prolly trees via prolly.DiffMaps during WorkingRootUpdated, but this traversal is not protected against concurrent DOLT_GC. If GC deletes chunks mid-traversal, the server panics.

Changes

  • doltdb.go: Add GCPausableListener interface with Stop() chan struct{} and Resume() methods
  • binlog_producer.go: Implement GCPausableListener using sync.RWMutexWorkingRootUpdated holds RLock during diff traversal, Stop() acquires write lock (blocks until in-flight traversals drain)
  • dolt_gc.go: Before running GC, iterate DatabaseUpdateListeners and call Stop() on any that implement GCPausableListener. Resume all after GC completes.
  • binlog_producer_gc_test.go: Comprehensive test suite covering no-inflight drain, inflight wait, multiple concurrent inflight, stop/resume idempotency, and concurrent stop+WorkingRootUpdated contention

Design

Uses sync.RWMutex for minimal overhead on the hot path (WorkingRootUpdated only takes a read lock). The Stop() method returns a channel so the caller can wait asynchronously for all in-flight operations to drain before proceeding with GC.

Testing

  • All new tests pass including race detector (go test -race)
  • Verified interface satisfaction at compile time

…letion during Prolly tree traversal

The binlog producer's WorkingRootUpdated method traverses Prolly trees via
prolly.DiffMaps to generate row events for connected replicas. If DOLT_GC
runs concurrently, it can delete chunks that the binlog producer is actively
traversing, causing "empty chunk returned from ChunkStore" panics.

This adds a GCPausableListener interface that the binlog producer implements.
Before GC runs, it calls Stop() which acquires a write lock, blocking until
all in-flight WorkingRootUpdated calls (which hold read locks) complete. After
GC finishes, Resume() releases the write lock, allowing new binlog events to
be generated.

Changes:
- Add GCPausableListener interface to doltdb package
- Implement Stop()/Resume() on binlogProducer using sync.RWMutex
- Add RLock/RUnlock around WorkingRootUpdated's diff traversal
- Pause GCPausableListeners in RunDoltGC before running GC
- Comprehensive test suite for Stop/Resume/drain semantics
Empty strings in TEXT/BLOB columns are stored with a zero hash address
(no out-of-band chunk). The binlog serializer unconditionally called
ReadBytes on this address, causing "empty chunk returned from ChunkStore"
panic. Skip the ChunkStore lookup when the address is the zero hash and
encode zero-length data instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@reltuk
Copy link
Copy Markdown
Contributor

reltuk commented Mar 2, 2026

Hi @scbrown. Thanks for the report!

Can you provide an example server configuration where you saw the panic?

I do not believe the approach in this PR is the approach we would want to take. But looking at the code, it does seem possible that the retained roots tracking and GC safepoints on the binlogreplication diff path have been misrepresented in some way. Being able to test more deeply and dig into the root cause would be helpful.

@scbrown
Copy link
Copy Markdown
Author

scbrown commented Mar 2, 2026

Hi @reltuk, thanks for the quick response!

Server configuration

# config.yaml
listener:
  host: 0.0.0.0
  port: 3306
  max_connections: 500

remotesapi:
  port: 50051
// .dolt/config.json (persisted system variables)
{
  "sqlserver.global.log_bin": "1",
  "sqlserver.global.gtid_mode": "ON",
  "sqlserver.global.enforce_gtid_consistency": "ON",
  "sqlserver.global.binlog_row_metadata": "FULL",
  "sqlserver.global.dolt_transaction_commit": "1"
}

Running Dolt v1.83.0 on Linux (Ubuntu 22.04, go1.23.4).

What we actually hit

To be transparent: the panic we reproduced in production was not from a GC race. It was the zero-hash issue described in the companion PR #10603.

Our production table has ~11K rows with 7 TEXT columns, many containing empty strings (''). Empty strings store a zero hash as the blob address in the prolly tuple. encodeBytesFromAddress passes this zero hash to ns.ReadBytes(), which calls ChunkStore.Get() with hash zero → EmptyChunk → assertTrue panic.

We filed this GC PR separately because we noticed WorkingRootUpdated traverses prolly trees via DiffMaps and we weren't sure if this path participates in the existing GC safepoint infrastructure. It seemed worth flagging given the similarity to #9001/#9077 (stats path).

Custom binary testing

We built a custom binary from our fork with both fixes applied and tested on our production server:

  1. Phase 1 (defensive): Converted the assertTrue panic in nodeStore.Read to a returned error + wrapped errors in encodeBytesFromAddress with address/type context. Result: server stays up, logs the error gracefully, commits succeed.

  2. Phase 2 (root cause): Added addr.IsEmpty() check in encodeBytesFromAddress — if the address is all zeros, serialize as a zero-length value with the correct length-prefix size for the blob type. Result: no errors at all, clean binlog serialization for empty, NULL, and populated TEXT columns.

Test matrix on production data (11K rows, 7 TEXT columns):

  • INSERT with non-empty TEXT + DOLT_COMMIT → PASS
  • INSERT with empty string TEXT ('') + DOLT_COMMIT → PASS
  • INSERT with NULL TEXT + DOLT_COMMIT → PASS
  • Mixed columns (some empty, some populated, some NULL) + DOLT_COMMIT → PASS
  • Multiple consecutive commits with binlog ON → Server stable, no panics or errors

On GC safepoints

Your point about retained roots tracking is well taken. If the binlog replication diff path already participates in GC safepoints through a mechanism we missed, this PR may be unnecessary. The zero-hash fix in #10603 is the one that addresses our actual crash. Happy to close this one if you'd prefer to handle the GC coordination through your existing infrastructure.

Let me know if you'd like any additional detail or testing on either PR!

@timsehn
Copy link
Copy Markdown
Contributor

timsehn commented Mar 3, 2026

Closing per discussion. We have a fix for the real bug you found.

@timsehn timsehn closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Binlog producer does not participate in GC safepoints — concurrent DOLT_GC can delete chunks mid-traversal

4 participants