fix(binlog): add GC safepoint for binlog producer to prevent chunk deletion during traversal by scbrown · Pull Request #10604 · dolthub/dolt

scbrown · 2026-03-02T00:36:56Z

Summary

The binlog producer traverses Prolly trees via prolly.DiffMaps during WorkingRootUpdated, but this traversal is not protected against concurrent DOLT_GC. If GC deletes chunks mid-traversal, the server panics.

Changes

doltdb.go: Add GCPausableListener interface with Stop() chan struct{} and Resume() methods
binlog_producer.go: Implement GCPausableListener using sync.RWMutex — WorkingRootUpdated holds RLock during diff traversal, Stop() acquires write lock (blocks until in-flight traversals drain)
dolt_gc.go: Before running GC, iterate DatabaseUpdateListeners and call Stop() on any that implement GCPausableListener. Resume all after GC completes.
binlog_producer_gc_test.go: Comprehensive test suite covering no-inflight drain, inflight wait, multiple concurrent inflight, stop/resume idempotency, and concurrent stop+WorkingRootUpdated contention

Design

Uses sync.RWMutex for minimal overhead on the hot path (WorkingRootUpdated only takes a read lock). The Stop() method returns a channel so the caller can wait asynchronously for all in-flight operations to drain before proceeding with GC.

Testing

All new tests pass including race detector (go test -race)
Verified interface satisfaction at compile time

…letion during Prolly tree traversal The binlog producer's WorkingRootUpdated method traverses Prolly trees via prolly.DiffMaps to generate row events for connected replicas. If DOLT_GC runs concurrently, it can delete chunks that the binlog producer is actively traversing, causing "empty chunk returned from ChunkStore" panics. This adds a GCPausableListener interface that the binlog producer implements. Before GC runs, it calls Stop() which acquires a write lock, blocking until all in-flight WorkingRootUpdated calls (which hold read locks) complete. After GC finishes, Resume() releases the write lock, allowing new binlog events to be generated. Changes: - Add GCPausableListener interface to doltdb package - Implement Stop()/Resume() on binlogProducer using sync.RWMutex - Add RLock/RUnlock around WorkingRootUpdated's diff traversal - Pause GCPausableListeners in RunDoltGC before running GC - Comprehensive test suite for Stop/Resume/drain semantics

Empty strings in TEXT/BLOB columns are stored with a zero hash address (no out-of-band chunk). The binlog serializer unconditionally called ReadBytes on this address, causing "empty chunk returned from ChunkStore" panic. Skip the ChunkStore lookup when the address is the zero hash and encode zero-length data instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

reltuk · 2026-03-02T19:38:23Z

Hi @scbrown. Thanks for the report!

Can you provide an example server configuration where you saw the panic?

I do not believe the approach in this PR is the approach we would want to take. But looking at the code, it does seem possible that the retained roots tracking and GC safepoints on the binlogreplication diff path have been misrepresented in some way. Being able to test more deeply and dig into the root cause would be helpful.

scbrown · 2026-03-02T20:15:01Z

Hi @reltuk, thanks for the quick response!

Server configuration

# config.yaml
listener:
  host: 0.0.0.0
  port: 3306
  max_connections: 500

remotesapi:
  port: 50051

// .dolt/config.json (persisted system variables)
{
  "sqlserver.global.log_bin": "1",
  "sqlserver.global.gtid_mode": "ON",
  "sqlserver.global.enforce_gtid_consistency": "ON",
  "sqlserver.global.binlog_row_metadata": "FULL",
  "sqlserver.global.dolt_transaction_commit": "1"
}

Running Dolt v1.83.0 on Linux (Ubuntu 22.04, go1.23.4).

What we actually hit

To be transparent: the panic we reproduced in production was not from a GC race. It was the zero-hash issue described in the companion PR #10603.

Our production table has ~11K rows with 7 TEXT columns, many containing empty strings (''). Empty strings store a zero hash as the blob address in the prolly tuple. encodeBytesFromAddress passes this zero hash to ns.ReadBytes(), which calls ChunkStore.Get() with hash zero → EmptyChunk → assertTrue panic.

We filed this GC PR separately because we noticed WorkingRootUpdated traverses prolly trees via DiffMaps and we weren't sure if this path participates in the existing GC safepoint infrastructure. It seemed worth flagging given the similarity to #9001/#9077 (stats path).

Custom binary testing

We built a custom binary from our fork with both fixes applied and tested on our production server:

Phase 1 (defensive): Converted the assertTrue panic in nodeStore.Read to a returned error + wrapped errors in encodeBytesFromAddress with address/type context. Result: server stays up, logs the error gracefully, commits succeed.
Phase 2 (root cause): Added addr.IsEmpty() check in encodeBytesFromAddress — if the address is all zeros, serialize as a zero-length value with the correct length-prefix size for the blob type. Result: no errors at all, clean binlog serialization for empty, NULL, and populated TEXT columns.

Test matrix on production data (11K rows, 7 TEXT columns):

INSERT with non-empty TEXT + DOLT_COMMIT → PASS
INSERT with empty string TEXT ('') + DOLT_COMMIT → PASS
INSERT with NULL TEXT + DOLT_COMMIT → PASS
Mixed columns (some empty, some populated, some NULL) + DOLT_COMMIT → PASS
Multiple consecutive commits with binlog ON → Server stable, no panics or errors

On GC safepoints

Your point about retained roots tracking is well taken. If the binlog replication diff path already participates in GC safepoints through a mechanism we missed, this PR may be unnecessary. The zero-hash fix in #10603 is the one that addresses our actual crash. Happy to close this one if you'd prefer to handle the GC coordination through your existing infrastructure.

Let me know if you'd like any additional detail or testing on either PR!

timsehn · 2026-03-03T16:15:32Z

Closing per discussion. We have a fix for the real bug you found.

coffeegoddd added the contribution label Mar 2, 2026

timsehn closed this Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(binlog): add GC safepoint for binlog producer to prevent chunk deletion during traversal#10604

fix(binlog): add GC safepoint for binlog producer to prevent chunk deletion during traversal#10604
scbrown wants to merge 2 commits intodolthub:mainfrom
scbrown:fix/binlog-producer-gc-safepoint

scbrown commented Mar 2, 2026

Uh oh!

reltuk commented Mar 2, 2026

Uh oh!

scbrown commented Mar 2, 2026

Uh oh!

timsehn commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

scbrown commented Mar 2, 2026

Summary

Changes

Design

Testing

Uh oh!

reltuk commented Mar 2, 2026

Uh oh!

scbrown commented Mar 2, 2026

Server configuration

What we actually hit

Custom binary testing

On GC safepoints

Uh oh!

timsehn commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants