Skip to content

[POC] heal the worldstate#4972

Closed
matkt wants to merge 24 commits into
besu-eth:mainfrom
matkt:feature/heal-worldstate
Closed

[POC] heal the worldstate#4972
matkt wants to merge 24 commits into
besu-eth:mainfrom
matkt:feature/heal-worldstate

Conversation

@matkt

@matkt matkt commented Jan 22, 2023

Copy link
Copy Markdown
Contributor

Signed-off-by: Karim TAAM karim.t2am@gmail.com

PR description

This PR add a heal mechanism fo the worldtstate in case of inconsistency (unable to trie node). To fix this, we start a quick heal of the worldstate automatically and once the fix is done we restart the block import.

After the detection of an invalid path

  • we delete this path to force the healing of this part.
  • then we delete the trielogs.
  • we select a pivot block (before of after the heal)
  • we move the blokchain to this pivot block (rewind or download the missing blocks)
  • then we launch a worldstate heal.

This feature can also heal a node that has been inconsistent for a long time, but it will take longer because there will be more nodes to heal. With this PR the healing will be done as soon as the problem is detected so there will not a lot to heal and it will be fast

Performed tests

  • Trigger multiple inconsistencies to fix (passed)
  • Fixed a node that has been inconsistent for a long time (passed)
  • Run a snapsync from scratch on goerli (passed)
  • Run a checkpoint sync from scratch on goelri (passed)
  • Run a checkpoint sync from scratch on main (in progress)
  • Run a validator teku+besu on goerli (passed)
  • Profile performance on this node to avoid perf regression (passed)

Fixed Issue(s)

#4379
#4785
#4768

Documentation

  • I thought about documentation and added the doc-change-required label to this PR if
    updates are required.

Changelog

@matkt matkt force-pushed the feature/heal-worldstate branch 2 times, most recently from f173297 to d440264 Compare January 22, 2023 10:53
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
@matkt matkt force-pushed the feature/heal-worldstate branch 4 times, most recently from 750c53b to fd31a2d Compare January 25, 2023 09:02
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
@matkt matkt force-pushed the feature/heal-worldstate branch 2 times, most recently from 6593ccd to 1f1f96b Compare January 30, 2023 18:59
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
@matkt matkt force-pushed the feature/heal-worldstate branch from 39f1ed8 to dead4f8 Compare January 31, 2023 09:22
@matkt matkt force-pushed the feature/heal-worldstate branch from 3103951 to 1311342 Compare February 1, 2023 13:56
matkt added 5 commits February 1, 2023 14:57
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Matt Nelson <monels11@gmail.com>
matkt and others added 6 commits February 2, 2023 11:09
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: garyschulte <garyschulte@gmail.com>
@garyschulte garyschulte marked this pull request as ready for review February 3, 2023 20:54
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
@matkt matkt force-pushed the feature/heal-worldstate branch from 90bfed2 to 960a8bb Compare February 3, 2023 22:41
public boolean isInitialSyncPhaseDone() {
return isInitialSyncPhaseDone;
public boolean isResyncNeeded() {
return isResyncNeeded;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never reset this value afaict. once we need a resync, we always need a resync ;)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed I switch this flag on markInitialSyncPhaseAsDone

!= BlockHeader.GENESIS_BLOCK_NUMBER) {
LOG.info(
"Checkpoint sync was requested, but cannot be enabled because the local blockchain is not empty.");
"Snap sync was requested, but cannot be enabled because the local blockchain is not empty.");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be "checkpoint"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

trieLogStorage,
fallbackNodeFinder);
final KeyValueStorage trieLogStorage) {
super(accountStorage, codeStorage, storageStorage, trieBranchStorage, trieLogStorage);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

static FinalBlockConfirmation ancestorConfirmation(final MutableBlockchain blockchain) {
return firstHeader -> blockchain.contains(firstHeader.getParentHash());
return firstHeader ->
blockchain.getChainHeadHeader().getHash().equals(firstHeader.getParentHash());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might this break backward sync for a reorg. e.g. blockhain.contains() will return true if we have block A' but it isn't the chain head. This check requires it to be the chain head...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the blck number. Should be fine now but. if you can verify

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I pull this branch here I see:
return firstHeader -> blockchain.getChainHeadBlockNumber() + 1 >= firstHeader.getNumber();
I might be missing something, but I read the firstAncestor block as being the earliest block fetched by backward sync. and finalBlockConfirmation.ancestorHeaderReached as the termination condition for fetching further back in BackwardSyncAlgorithm.pickNextStep.

If we are backward syncing because of a fork from 5 or 10 blocks behind head, I think this condition could terminate early. it is almost logically identical to the conditional above it:
return firstHeader -> blockchain.getChainHeadBlockNumber() + 1 >= firstHeader.getNumber();

The first condition makes sense to me b/c we really just need to ensure this backward path resolves to something we have, rather than a particular height.

@matkt matkt Feb 6, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I understand your remark. On the other hand, I have to keep this check because it allows me to have the backardsync works after the heal in case we rewind. Because we will have the blocks but I still want to go back to import them again . So I think I should add the initial condition + this one. In the case of a REORG the first condition will make sure that it does not stop too soon

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garyschulte garyschulte left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, some of the backward sync changes are concerning.

matkt and others added 3 commits February 4, 2023 10:23
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
Signed-off-by: Karim TAAM <karim.t2am@gmail.com>

@siladu siladu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ethereum/referencetests/src/reference-test/external-resources
be changing as part of this PR?

private final WorldStateArchive worldStateArchive;
private final ConsensusContext consensusContext;

private Optional<Synchronizer> synchronizer;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we should consider alternatives to bloating ProtocolContext, a global object.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is indeed a problem. I tried a lot to do otherwise but this was the cleanest I found with the current code structure

if (worldStateStorage instanceof BonsaiWorldStateKeyValueStorage) {
LOG.info("Clearing bonsai flat account db");
worldStateStorage.clearFlatDatabase();
worldStateStorage.clearTrieLog();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to do this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will not be able to go back before this pivot block. In addition, we could have bad trielog so it is preferable to heal them too. Finally keeping old trielog could make a false positive for the isWorldStateAvailable.

protocolContext.getBlockchain().getChainHeadHash().equals(pivotBlockHeader.getHash());
if (!isValidChainHead) {
if (protocolContext.getBlockchain().contains(pivotBlockHeader.getHash())) {
protocolContext.getBlockchain().rewindToBlock(pivotBlockHeader.getHash());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the pivot block be > the max allowable bonsai trie layers away? If so, could this rewind take an unfeasible amount of time or maybe error? If so, then what state is the user left in and is it recoverable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part rewind only the blockchain part not the worldstate part.
If the pivot block is before the current head, we do a rewind otherwise we do nothing. In any case the worldstate will be healed with the new pivot block without doing a rollback or rollforward because we cannot do it if it is corrupted

it is important to choose a new pivot block and not to heal the current head. Because we are not sure that block will remain on the canonical chain

fastSyncStateStorage.loadState(ScheduleBasedBlockHeaderFunctions.create(protocolSchedule));

if (isResync) {
worldStateStorage.clear();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why no need to clear the worldStateStorage now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a resync there is already a clear in the Worldstate downloader part https://github.com/hyperledger/besu/pull/4972/files#diff-28d8c7a62dddded52099a727bf58aa864d58b055795b95ebfeee6ac8e2a8cea2R173
So it's clearly useless to do it again @garyschulte

@matkt

matkt commented Feb 5, 2023

Copy link
Copy Markdown
Contributor Author

Should ethereum/referencetests/src/reference-test/external-resources be changing as part of this PR?

yes good catch I fixed that

Signed-off-by: Karim TAAM <karim.t2am@gmail.com>
@matkt

matkt commented Feb 6, 2023

Copy link
Copy Markdown
Contributor Author

Because of some release difficulties with this PR , I created another one @garyschulte @siladu -> #5059

@matkt matkt closed this Feb 6, 2023
@matkt matkt mentioned this pull request Feb 6, 2023
2 tasks
@matkt matkt deleted the feature/heal-worldstate branch July 5, 2023 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants