Skip to content

core/state/snapshot: snapshot generation shutdown race condition#33540

Open
JonathanOppenheimer wants to merge 7 commits intoethereum:masterfrom
JonathanOppenheimer:JonathanOppenheimer/fix-snapshot-stop-generation
Open

core/state/snapshot: snapshot generation shutdown race condition#33540
JonathanOppenheimer wants to merge 7 commits intoethereum:masterfrom
JonathanOppenheimer:JonathanOppenheimer/fix-snapshot-stop-generation

Conversation

@JonathanOppenheimer
Copy link
Copy Markdown

@JonathanOppenheimer JonathanOppenheimer commented Jan 6, 2026

Overview

This PR fixes a race condition during blockchain shutdown where snapshot generation could continue accessing the trie database after it has been closed, leading to iterator errors. We noticed this in one of our nodes on https://github.com/ava-labs/avalanchego, which relies on an older version of geth with the same issue (so this behavior does happen!).

During node shutdown, the following sequence occurs:

  1. BlockChain.Stop() calls snaps.Release() to clean up snapshot resources
  2. Release() only resets the cache but doesn't stop the generator goroutine
  3. The trie database is then closed via triedb.Close()
  4. The still-running generator attempts to iterate storage tries
  5. Iterator fails because the database is closed ("Generator failed to iterate storage trie")

Problem

There are three related bugs:

  1. Release() doesn't stop generation: The diskLayer.Release() method only resets the cache without stopping ongoing snapshot generation, leaving the generator goroutine running after database closure.
  2. stopGeneration() has an incorrect completion check: The stopGeneration() method checks genMarker != nil to determine if generation is running. However, genMarker is set to nil when generation completes successfully, even though the generator goroutine is still waiting for the abort signal at the end of generate(). See line 705 in generate.go:
    dl.lock.Lock()
    dl.genMarker = nil
    close(dl.genPending)
    dl.lock.Unlock()
    // Someone will be looking for us, wait it out
    abort = <-dl.genAbort
    abort <- nil
    }
    This means stopGeneration() returns early without sending the abort signal.
  3. Node shutdown doesn't stop generation: During shutdown, no code path calls stopGeneration() or sends the abort signal to the generator, causing the generator to access a closed database and error.

Fix

  • Modified diskLayer.Release() to call stopGeneration() before releasing resources
  • Added cancelation architecture, removing reliance on someone having to wait
  • Fixed stopGeneration() to properly and safely stop snapshot generation
  • Added TestReleaseStopsGeneration to verify the fix and prevent regression. The test fails without the fix and passes with it.
    • Test creates a snapshot with active generation, waits for completion, then calls Release()
    • Without the fix, the test hangs indefinitely (generator waiting for abort signal that never comes)
    • With the fix, the test completes in <3 seconds

Note that this fix follows the same pattern used in Tree.Disable() in #30040, which introduced stopGeneration() for use in Disable() and Rebuild() but didn't address the shutdown path.

@JonathanOppenheimer JonathanOppenheimer changed the title snapshot: snapshot generation shutdown race condition core/state/snapshot: snapshot generation shutdown race condition Jan 6, 2026
@JonathanOppenheimer JonathanOppenheimer force-pushed the JonathanOppenheimer/fix-snapshot-stop-generation branch from 263f75c to 1e675e3 Compare January 6, 2026 20:28
@JonathanOppenheimer JonathanOppenheimer changed the title core/state/snapshot: snapshot generation shutdown race condition core/state/snapshot: snapshot generation shutdown race condition Jan 6, 2026
@JonathanOppenheimer JonathanOppenheimer changed the title core/state/snapshot: snapshot generation shutdown race condition core/state/snapshot: snapshot generation shutdown race condition Jan 6, 2026
Co-authored-by: Tsvetan Dimitrov tsvetan.dimitrov@avalabs.org
@JonathanOppenheimer JonathanOppenheimer marked this pull request as ready for review January 12, 2026 17:07
@fjl fjl removed the status:triage label Jan 13, 2026
Comment thread core/state/snapshot/generate.go Outdated

abort = <-dl.genAbort
abort <- stats
dl.genStats = stats
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this being read from? Kinda suspicious, looks like it might need to be in a lock

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this doesn't need a lock -- it's only read after stopGeneration() has terminated -- snapshot.go:520-631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants