Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BEEFY validators skip finality proof when they are all closed at the same time #2842

Closed
serban300 opened this issue Jan 3, 2024 · 6 comments
Assignees
Labels
I2-bug The node fails to follow expected behavior.

Comments

@serban300
Copy link
Contributor

The root cause of #2285 was that no validator had a BEEFY finality proof for block #7829924 . They just skipped the justification for #7829924 and moved on. More details in this comment: #2285 (comment)

Stumbled upon this issue also while working on #2787

Prerequisites:

  1. A zombienet config:
[relaychain]
default_image = "docker.io/parity/polkadot:latest"
default_command = "/home/serban/workplace/sources/polkadot-sdk/target/release/polkadot"

# refer to ./README.md for more details on how to create snapshot and spec
chain = "rococo-local"
chain_spec_path = "/home/serban/workplace/snapshots/db-test-gen/rococo-local.json"

  [[relaychain.nodes]]
  name = "alice"
  validator = true

  [[relaychain.nodes]]
  name = "bob"
  validator = true
  1. Adjust rococo-local epoch duration from:
	frame_support::parameter_types! {
		pub EpochDurationInBlocks: BlockNumber =
			prod_or_fast!(1 * HOURS, 1 * MINUTES, "ROCOCO_EPOCH_DURATION");
	}

to

	frame_support::parameter_types! {
		pub EpochDurationInBlocks: BlockNumber =
			prod_or_fast!(1 * HOURS, 1 * MINUTES / 6, "ROCOCO_EPOCH_DURATION");
	}

Reproduction steps:

  1. Start the zombienet. Let it run for ~60s.
  2. Close the zombienet. I think it needs to be closed right when a finality proof is generated, but it can also be close randomly.
  3. Start the zombienet again and also start a 3rd node separately:
./target/release/polkadot --chain=/home/serban/workplace/snapshots/db-test-gen/rococo-local.json --dave --tmp --enable-offchain-indexing=true --sync warp --reserved-nodes "/ip4/0.0.0.0/tcp/30310/p2p/12D3KooWQCkBm1BYtkHpocxCwMgR8yjitEeHGx8spzcDLGt2gkBm"
  1. Check the logs of the 3rd node. See if it prints something like 🥩 ran out of peers to request justif #43 from. If not, repeat steps 2-3

The problem seems related to closing all validators at the same time. Maybe saving the AUX DB on drop as suggested here would fix this.

@serban300 serban300 added the I2-bug The node fails to follow expected behavior. label Jan 3, 2024
@serban300 serban300 self-assigned this Jan 3, 2024
@acatangiu
Copy link
Contributor

Maybe saving the AUX DB on drop as suggested #2378 (comment) would fix this.

drop() handlers are not run when node client crashes or is force killed (which I think are the two main reasons a client ever stops) so it won't actually help here IMO

The problem seems related to closing all validators at the same time.

On decentralized networks this should never be the case so I think this issue is low priority.

@stakeworld
Copy link
Contributor

@serban300 On my westend rpc nodes I get a lot of "🥩 ran out of peers to request justif #19153161 from". Always the same block number. I suppose something similar?

@serban300
Copy link
Contributor Author

Thanks for the report ! Yes, probably related to #3003

Published a fix which should prevent this from happening in the future: #3074

But it won't fix the missing BEEFY proofs on older blocks. Will have to investigate how to handle this.

@stakeworld
Copy link
Contributor

But it won't fix the missing BEEFY proofs on older blocks. Will have to investigate how to handle this.

Yes its a little spammy because the errors comes at every block, so around 10 per minute. I'm now filtering them out but that is of course not ideal.

github-merge-queue bot pushed a commit that referenced this issue Feb 1, 2024
@serban300
Copy link
Contributor Author

Merged the fix into master. Waiting for it to be deployed on Westend, and after that maybe we should restart BEEFY on Westend in order to fix the warning.

@serban300
Copy link
Contributor Author

bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this issue Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests

3 participants