fix(binlog): add GC safepoint for binlog producer to prevent chunk deletion during traversal#10604
fix(binlog): add GC safepoint for binlog producer to prevent chunk deletion during traversal#10604scbrown wants to merge 2 commits intodolthub:mainfrom
Conversation
…letion during Prolly tree traversal The binlog producer's WorkingRootUpdated method traverses Prolly trees via prolly.DiffMaps to generate row events for connected replicas. If DOLT_GC runs concurrently, it can delete chunks that the binlog producer is actively traversing, causing "empty chunk returned from ChunkStore" panics. This adds a GCPausableListener interface that the binlog producer implements. Before GC runs, it calls Stop() which acquires a write lock, blocking until all in-flight WorkingRootUpdated calls (which hold read locks) complete. After GC finishes, Resume() releases the write lock, allowing new binlog events to be generated. Changes: - Add GCPausableListener interface to doltdb package - Implement Stop()/Resume() on binlogProducer using sync.RWMutex - Add RLock/RUnlock around WorkingRootUpdated's diff traversal - Pause GCPausableListeners in RunDoltGC before running GC - Comprehensive test suite for Stop/Resume/drain semantics
Empty strings in TEXT/BLOB columns are stored with a zero hash address (no out-of-band chunk). The binlog serializer unconditionally called ReadBytes on this address, causing "empty chunk returned from ChunkStore" panic. Skip the ChunkStore lookup when the address is the zero hash and encode zero-length data instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hi @scbrown. Thanks for the report! Can you provide an example server configuration where you saw the panic? I do not believe the approach in this PR is the approach we would want to take. But looking at the code, it does seem possible that the retained roots tracking and GC safepoints on the binlogreplication diff path have been misrepresented in some way. Being able to test more deeply and dig into the root cause would be helpful. |
|
Hi @reltuk, thanks for the quick response! Server configuration# config.yaml
listener:
host: 0.0.0.0
port: 3306
max_connections: 500
remotesapi:
port: 50051// .dolt/config.json (persisted system variables)
{
"sqlserver.global.log_bin": "1",
"sqlserver.global.gtid_mode": "ON",
"sqlserver.global.enforce_gtid_consistency": "ON",
"sqlserver.global.binlog_row_metadata": "FULL",
"sqlserver.global.dolt_transaction_commit": "1"
}Running Dolt v1.83.0 on Linux (Ubuntu 22.04, go1.23.4). What we actually hitTo be transparent: the panic we reproduced in production was not from a GC race. It was the zero-hash issue described in the companion PR #10603. Our production table has ~11K rows with 7 TEXT columns, many containing empty strings ( We filed this GC PR separately because we noticed Custom binary testingWe built a custom binary from our fork with both fixes applied and tested on our production server:
Test matrix on production data (11K rows, 7 TEXT columns):
On GC safepointsYour point about retained roots tracking is well taken. If the binlog replication diff path already participates in GC safepoints through a mechanism we missed, this PR may be unnecessary. The zero-hash fix in #10603 is the one that addresses our actual crash. Happy to close this one if you'd prefer to handle the GC coordination through your existing infrastructure. Let me know if you'd like any additional detail or testing on either PR! |
|
Closing per discussion. We have a fix for the real bug you found. |
Fixes #10602
Summary
The binlog producer traverses Prolly trees via
prolly.DiffMapsduringWorkingRootUpdated, but this traversal is not protected against concurrentDOLT_GC. If GC deletes chunks mid-traversal, the server panics.Changes
doltdb.go: AddGCPausableListenerinterface withStop() chan struct{}andResume()methodsbinlog_producer.go: ImplementGCPausableListenerusingsync.RWMutex—WorkingRootUpdatedholds RLock during diff traversal,Stop()acquires write lock (blocks until in-flight traversals drain)dolt_gc.go: Before running GC, iterateDatabaseUpdateListenersand callStop()on any that implementGCPausableListener. Resume all after GC completes.binlog_producer_gc_test.go: Comprehensive test suite covering no-inflight drain, inflight wait, multiple concurrent inflight, stop/resume idempotency, and concurrent stop+WorkingRootUpdated contentionDesign
Uses
sync.RWMutexfor minimal overhead on the hot path (WorkingRootUpdated only takes a read lock). TheStop()method returns a channel so the caller can wait asynchronously for all in-flight operations to drain before proceeding with GC.Testing
go test -race)