Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize deposit before applying it to a state #3689

Closed
wants to merge 3 commits into from

Conversation

mkalinin
Copy link
Collaborator

@mkalinin mkalinin commented Apr 17, 2024

Proposal

Extends state.pending_balance_deposits to keep the deposit data and an epoch when a deposit was processed by the consensus layer. The deposit is applied to the state only when that epoch gets finalized and if the activation churn has enough capacity.

Analysis

  • Reduces engineering complexity of EIP-6110 implementation as an index of a new validator is associated with the pubkey only after the deposit is finalized, so there is no need to support forks in (pubkey, index) cache implementation.
  • Increases beacon state size as each entry of the deposits queue needs to hold 176 bytes of additional data (pubkey, withdrawal credentials and signature) which is a 12 times increase. For example, with 100_000 validators in the activation queue the pending deposits queue will consume ~18MB vs 1.5MB of just pending deposit balances.
  • Rate limits signature verifications per epoch by the means of the activation queue, but moves all verification operations to the epoch processing.
  • Requires top-ups to be finalized before they can be processed.
  • Invalid deposits will waste churn. Considering a high cost of submitting an invalid deposit, the potential effect should be negligible.
  • Makes no use for validator.activation_aligibility_epoch which can be repurposed in the future.

@dapplion
Copy link
Collaborator

The implementation complexity of the fork-aware pubkey cache has not been as bad as I originally imagined in Lodestar and Lighthouse. Would like to know from other teams too.

Copy link
Contributor

@fradamt fradamt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since deposits are not applied until after finalization, we don't need activation_eligibility_epoch anymore, as its role is replaced by deposit.epoch in a PendingDeposit. Then we can remove is_eligible_for_activation_queue from process_registry_updates, and move its balance check to is_eligible_for_activation:

def is_eligible_for_activation(state: BeaconState, validator: Validator) -> bool:
    """
    Check if ``validator`` is eligible for activation.
    """
    return (
        # Has not yet been activated
        and validator.activation_epoch == FAR_FUTURE_EPOCH
        # Has sufficient effective balance
        and validator.effective_balance >= MIN_ACTIVATION_BALANCE
    )

specs/electra/beacon-chain.md Show resolved Hide resolved
@@ -1271,12 +1258,14 @@ def process_deposit_receipt(state: BeaconState, deposit_receipt: DepositReceipt)
if state.deposit_receipts_start_index == UNSET_DEPOSIT_RECEIPTS_START_INDEX:
state.deposit_receipts_start_index = deposit_receipt.index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about checking is_valid_deposit_signature here? Unless we specifically want to avoid burning a top-up with incorrect signature

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two things why do signature check as is:

  1. Some infra might already have top-ups sent with zero signatures (worth checking tho)
  2. Rate limiting a number of deposit signature checks per epoch is really nice, it alleviates the risk of delaying block processing if someone manages to pack a thousand deposits into it

@StefanBratanov
Copy link
Contributor

The implementation complexity of the fork-aware pubkey cache has not been as bad as I originally imagined in Lodestar and Lighthouse. Would like to know from other teams too.

In Teku, we implemented it in a way, where we cache only the finalized indices and for the rest we do an N loop starting from the last finalized index + 1. It could be a problem for chains which are not finalizing for a long time, but otherwise I don't see a big issue with it, but would prefer if this PR makes it into the specs.

@dapplion
Copy link
Collaborator

An important consideration w.r.t. the inactivity leak is that this PR prevents top-ups during non-finality. This affects both the ability of leaking validators to maintain their stake share, and non-leaking validators from increasing their share.

@mkalinin
Copy link
Collaborator Author

An important consideration w.r.t. the inactivity leak is that this PR prevents top-ups during non-finality. This affects both the ability of leaking validators to maintain their stake share, and non-leaking validators from increasing their share.

This is an important consideration. Do you think there is any risk related to that?

Honest validators’ stake should remain stable as they keep voting — this is at least a desired property, but i don’t know if in practice this property holds in all cases as there might be not enough block space to accommodate all submitted attestations on chain thus honest validators may leak too. Preventing dishonest validators that aren’t voting from increasing their stake share is not strongly desired but a nice property, imo.

@philknows philknows mentioned this pull request May 2, 2024
12 tasks
Copy link
Contributor

@etan-status etan-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the introduction of the queue, as it makes sure that BLS signature verifications remain bounded (16 / block). This way, the CL hardware requirements don't become a concern on every future gas limit increase.

Finality also simplifies implementations, because it means that validator indices to pubkey mappings are the same across all valid branches. As new validator activations can only happen while finalizing, this shouldn't affect the underlying PoS security.

The one thing that will no longer be possible is topping up a validator while the network is not finalizing / possibly leaking. I don't personally have concerns with that, but it should be documented at least as a semantic change that affects non-finalizing networks.

@@ -155,7 +153,7 @@ The following values are (non-configurable) constants used throughout the specif

| Name | Value | Unit |
| - | - | :-: |
| `PENDING_BALANCE_DEPOSITS_LIMIT` | `uint64(2**27)` (= 134,217,728) | pending balance deposits |
| `PENDING_DEPOSITS_LIMIT` | `uint64(2**27)` (= 134,217,728) | pending deposits |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2^32 would be the theoretical limit with infinite gas fee and a single block that fills the entire deposit tree

if processed_amount + deposit.amount > available_for_processing:
break
increase_balance(state, deposit.index, deposit.amount)
if deposit.epoch >= state.finalized_checkpoint.epoch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also check for the legacy style deposits import to have caught up with the queue.

Otherwise, deposits from the old style (via eth1_data) and from the new queue can get reordered, possibly messing with decentralized staking pools that rely on a linear deposit order.

@terencechain
Copy link
Contributor

The implementation complexity of the fork-aware pubkey cache has not been as bad as I originally imagined in Lodestar and Lighthouse. Would like to know from other teams too.

Same with Prysm. We implemented it the same way as Teku, as mentioned by @StefanBratanov. I'm indifferent to this, but if the sole purpose is to remove the fork-aware pubkey cache, I don't think it's necessary.

@mkalinin
Copy link
Collaborator Author

Outcomes of the breakout session on the change proposed by this PR:

  • Queueing deposits is important to rate limit sig verification operations. There is a desire not only to limit a number of deposits processed per epoch by the churn, but also by a number (the proposed number is 16). The reason for that is that there can be legit blocks with hundreds of deposits in them (e.g. staking pools doing batch depositing) and we want to ensure that processing of such blocks won't be delayed.
  • We want to benchmark processing a block with ~1000 deposits in it to understand how big of the problem it could be.
  • Finalizing (pubkey, index) might not only matter for client implementations but also for tooling and infrastructure like block explorers. Though, having this cache fork-independent became a weak argument since most of the teams implemented it, it still can prevent potential bugs in clients and reduces complexity.
  • To alleviate hash_tree_root complexity of potentially large and slicing queue, we want to explore two queue solution or queue with two pointers solution. The key intuition behind either of such solutions is that any queue is append only and reset to empty, no slicing is possible.
  • Deposits should be processed in strict order around the transition to prevent frontrunning. Potential solution is to create new validators induced by Eth1Data poll deposits instantly but add synthesized deposits with the entire balance of that validators to the end of the queue so the balance will hit the churn; and prevent new style deposits from being processed until the Eth1Data poll gap is filled.
  • We agreed that we do accept data complexity (potential state growth) and the fact that no top-ups is possible during period of non-finality. The latter even seems a positive thing.

@pawanjay176
Copy link
Contributor

pawanjay176 commented May 17, 2024

Benchmarked a block with 1000 deposits using lcli.
Attaching the block and pre-state if anyone wants to run it on their clients.
big-deposit-test.zip

[2024-05-17T11:59:49Z INFO  lcli::transition_blocks] Run 0: 841.606125ms                                                                                                                                              
[2024-05-17T11:59:49Z DEBUG lcli::transition_blocks] Build caches: 1.894041ms                                                                                                                                         
[2024-05-17T11:59:49Z DEBUG lcli::transition_blocks] Initial tree hash: 225.958µs                                                                                                                                     
[2024-05-17T11:59:49Z DEBUG lcli::transition_blocks] Slot processing: 5.209µs                                                                                                                                         
[2024-05-17T11:59:49Z DEBUG lcli::transition_blocks] Build all caches (again): 875ns                                                                                                                                  
[2024-05-17T11:59:49Z DEBUG lcli::transition_blocks] Batch verify block signatures: 4.483875ms                                                                                                                        
[2024-05-17T11:59:50Z DEBUG lcli::transition_blocks] Process block: 838.27275ms                                                                                                                                       
[2024-05-17T11:59:50Z DEBUG lcli::transition_blocks] Post-block tree hash: 1.896833ms

(Note: the Batch verify block signatures time doesn't include deposit signatures)
The block was generated on a local kurtosis network by making bulk deposits.

Even on this relatively small network, the block arrived quite late on the node and took quite a bit to process.

May 17 11:56:13.776 DEBG Delayed head block                      set_as_head_time_ms: 2, imported_time_ms: 92, attestable_delay_ms: 6756, available_delay_ms: 6682, execution_time_ms: 3179, blob_delay_ms: unknown, observed_delay_ms: 3502, total_delay_ms: 6776, slot: 36, proposer_index: 125, block_root: 0x20ab5155dff25ded6c704d97028297a8d33af4b171c93b45c1bd7e8df3fec126, service: beacon

The execution layer processing itself took ~3secs. Consensus block processing is interleaved with the execution processing in lighthouse, so overall it took total_delay - observed_delay ~ 3.2s for processing the block.

@mkalinin
Copy link
Collaborator Author

@pawanjay176 thanks a lot for this benchmark!

Even on this relatively small network, the block arrived quite late on the node

I am curious why is that happened? It’s hard to believe that the propagation time took that many especially considering a small network. My guess would be that it wasn’t sent to the wire before fully processed by the proposer’s node, wdyt?

@pawanjay176
Copy link
Contributor

Sorry took a while, was on vacation.

My guess would be that it wasn’t sent to the wire before fully processed by the proposer’s node, wdyt?

Yeah I think that is right, the propagation seems near instant but producing the big block, sending it to the VC across the http api, getting it back and then publishing it finally took longer than for a smaller block.

"""
return (
validator.activation_eligibility_epoch == FAR_FUTURE_EPOCH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct to assume that validators can never have an activation_eligibility_epoch > ELECTRA_FORK_EPOCH? e.g., by depositing during the last slot before electra activates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those validators are added to the pending deposits queue in upgrade_to_electra

epoch=get_current_epoch(state),
pubkey=validator.pubkey,
withdrawal_credentials=validator.withdrawal_credentials,
amount=balance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excess_balance? Because MIN_ACTIVATION_BALANCE was already credited.

pubkey=validator.pubkey,
withdrawal_credentials=validator.withdrawal_credentials,
amount=balance,
signature=BLSSignature(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SyncAggregate, G2_POINT_AT_INFINITY is used instead of a zero value when noone signed.

Or, None with EIP-7688 optionals (but I guess zero value or infinity is also fine, as it is not checked anyway).

PendingBalanceDeposit(index=index, amount=excess_balance)
state.pending_deposits.append(
PendingDeposit(
epoch=get_current_epoch(state),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an off-by-one here?

When epoch N finalizes, everything up through and including compute_start_slot_at_epoch(N) is finalized.

  1. If activations are processed for deposit.epoch <= finalized_checkpoint.epoch, then that would also activate deposits on the non-finalized slots within N (only the initial slot of N is finalized).
  2. If activations are processed for deposit.epoch < finalized_checkpoint.epoch, then deposits within the very first slot of N are not processed for another ~6 minutes despite already being finalized.

Currently, it seems we are following the second strategy (deposit.epoch >= state.finalized_checkpoint.epoch). Probably fine to have the extra ~6 minutes wait for deposits within the first slot, but could be avoided, so still keeping the comment for awareness.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally in favour of strategy (2) because of its simplicity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this and I changed my mind. The issue is that clients may (and probably will) filter deposits by blocks or db invariants like "finalized", the presence of this off by one creates edge cases that need to be always checked and tested. While it's not likely to create a consensus bug I feel it will be a source of bugs nonetheless

Copy link
Collaborator Author

@mkalinin mkalinin Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set deposit.epoch in the queue to get_current_epoch(state) + 1, will that prevent issues that you have in mind? It can also be renamed to deposit.activation_epoch to better reflect the semantics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess storing slot and processing deposits up through deposit_receipt.slot <= compute_start_slot_at_epoch(state.finalized_checkpoint.epoch) is the most straight forward.

PendingDeposit(
epoch=get_current_epoch(state),
pubkey=validator.pubkey,
withdrawal_credentials=validator.withdrawal_credentials,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are withdrawal_credentials necessary to be saved here?

  • If yes, there could be a race where a BLSToExecutionChange gets processed while the PendingDeposit is in the queue, changing the withdrawal credentials of the validator while the PendingDeposit retains the stale entry.
  • If no, we could just set this to a zero value (or to None with EIP-7688 optionals).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn’t necessary so I think that setting it to zero value should be fine

@mkalinin
Copy link
Collaborator Author

Closing in favour of #3818

1 similar comment
@mkalinin
Copy link
Collaborator Author

Closing in favour of #3818

@mkalinin mkalinin closed this Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants