Skip to content

Comments

Bedrock: updated Batch Derivation specification#2807

Closed
protolambda wants to merge 4 commits intodevelopfrom
batch-deriv-v2-spec
Closed

Bedrock: updated Batch Derivation specification#2807
protolambda wants to merge 4 commits intodevelopfrom
batch-deriv-v2-spec

Conversation

@protolambda
Copy link
Contributor

@protolambda protolambda commented Jun 17, 2022

This spec update describes the updated batch derivation process, as well as batch-submission (previously missing from the spec!), and approaches derivation more systematically (in stages).

This is largely based on two docs:

  • The latest design proposal in notion (stable for the last week at least)
  • The existing rollup_node.md contents: split of from the node spec. The rollup-node is more than just derivation, and derivation is a topic that deserves its own document.

Update:

  • Added diagrams

There are some imperfections, and I need to update ToCs + linting. This is a draft.

See #2714 for ongoing implementation work.

Key bedrock improvements:

  • Data of L2 blocks can be spread across multiple L1 data transactions (fixes several sub-issues)
  • Stability/testing design improvements: isolated stages instead of monolithic driver state
  • Lower latency to L2 head via L1 syncing, with eager batch derivation
  • Improved reorg handling: the derivation pipeline can best determine when to redo derivation based on L1 reorgs. Previously the L1 head changes could force a sudden reorg, even if the L2 engine was not synced up to the reorged L1 blocks yet.
  • Fix error handling in the interactions with the execution-engine: an RPC error is not the same as an invalid transaction, payload attributes should never accidentally be dropped.
  • Improve robustness of derivation: smaller steps and buffering, any small RPC failure does not result in cascading effects.

Known bugs/issues in spec:

  • Eager batch derivation is not fully specced yet.
    • @trianglesphere is working on this in notion
    • I ported over the existing spec of full-sequencing window batch derivation, but split the batch-logic from the payload-attributes-derivation logic.
  • channels should have a timestamp (time constrained by inclusion of frame data) be part of their uniqueness: channels may not revive, or sync may not be consistent when not starting at genesis, even after waiting for a full timeout period
    • discussed with Josh, should be a simple fix
  • The batch submission minSize API argument is not specced/implemented yet.
    • maybe best to ignore until we test more of the derivation updates, it's not critical
  • Meta:
    • Broken links need to be checked/fixed
    • Table of Contents re-generation
    • Markdown linting fixes (ported a lot over from notion, it's not all valid with the github linter)

@changeset-bot
Copy link

changeset-bot bot commented Jun 17, 2022

⚠️ No Changeset found

Latest commit: b94c634

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@protolambda protolambda marked this pull request as ready for review June 22, 2022 12:04
Copy link
Contributor

@norswap norswap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge review =)

First off, I've talked with Proto and given the amount of changes I'd like, I'll take over editing the spec to free up Proto on other things. He (and others) can then review the new spec PR.

I do have a few broad design & naming suggestions (these suggestions are also in review comments, but since there's so many of them, I'm hoisting here for visibility, sorry for duplication):

Naming: I don't love the current terminology of channels & batches.
- channel generally imply a communication conduit — here it's a big blob of data encoding multiple blocks
- batch normally implies multiplicity, but here a batch is a (simple) encoding of block inputs
- suggestion: rename "channel" to "batch" and "batch" to some paraphrase like "encoded block inputs" (it's not being used very often)

Architecture: it seems like a lot of the work could be done in the batcher instead of in the rollup ndoe and I think this mitigates some potential issues, see my comment at the end of batching.md for details.

I'm also curious (maybe dubious?) regarding the necessity of many concurrent channels. Seems like in most cases we're only using one at the same time? I think moving things to the batcher also helps mitigate the particular architectural overhead (the great thing thing about the design is that the on-chain data doesn't need to change, so we can potentially table this change for later if it is something we would like to do).

# Batch submitting

The process that signs and publishes transactions is not part of the rollup node, but an external “batcher”.
This is effectively the inverse function of the derivation process, but specified here for posterity. This API design is not enforced by the protocol.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably useful to mention that the batch format is enforced, and link to the page of definition.

The process that signs and publishes transactions is not part of the rollup node, but an external “batcher”.
This is effectively the inverse function of the derivation process, but specified here for posterity. This API design is not enforced by the protocol.

The batcher may use any data-availability platform that registers batch data within a chain of block-hashes and block-numbers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "a blockchain such that every block has a number, and every block includes the hash of its predecessor block"?

The batcher may use any data-availability platform that registers batch data within a chain of block-hashes and block-numbers.

The rollup node is mostly agnostic to the data platform: as a verifier the channel-bank has to traverse the chain and data has to be written to the bank.
Ethereum calldata is supported by default, other data-sources may be supported in future upgrades.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention the channel bank here before explaining what it is, which should generally be avoided.

Suggestion:

The rollup node is mostly agnostic to the data platform, as long as it can be treated as a traditional blockchain with numbers and block hashes, as explained earlier. Ethereum calldata is supported by default, other data-sources may be supported in future upgrades.

The part about the channel bank can be moved to the part where the channel bank is explained.


[Block derivation][g-derivation] is the process of building a chain of L2 [blocks][g-block] from a chain of L1 data.

The L1 data can be thought of as a stream, and L2 rollup nodes are the ***readers***, transforming their state with the inputs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

The L1 chain can be seen as a stream of L2 derivation inputs, which L2 rollup nodes read from, using the derivation inputs to derive he L2 chain and update their state.

The L1 data can be thought of as a stream, and L2 rollup nodes are the ***readers***, transforming their state with the inputs.

The L2 batcher nodes are the ***writers*** to the stream: they take L2 changes and submit the necessary data to L1.
Others can then reproduce the same L2 changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since batchers do not form an interconnected network, I would not calle them "nodes". Use "batcher" instead?

In general, this might lack precision: what are changes?

It might also be good to make explicit the distinction between verifiers and sequencers.

suggestion:

L2 batchers write to the stream (to L1): they take L2 blocks creates by a sequencer and submit the necessary data to L1, such that verifiers are able to derive the same blocks from L1.

Decode `BatchData` entries, each entry is prefixed with a version byte.
Unknown versions are invalid just like malformatted `BatchData`, and the pipeline should move to the next frame.

#### Batch encoding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see so, we do have a "batch", which is basically an encoded block. This invalidate some comment I made earlier.

Also not fond of the name, though I know we've used it like that in the past — batch implies multiples of things.

and encoded as a byte-string (including version prefix byte) in the bundle RLP list.

`<batch_version>`: `<name> = <content>`
- `0`: `BatchV1 = RLP([epoch, timestamp, transaction_list])`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was briefly confused about the name being part of the encoding. Making this a table with the first line as headers would make it clearer it's not the case.

Also do we really need names? (i.e. version 0 or "BatchV1" carries same info) Can add a comment column to the table later if need be.

- timestamp
- basefee
- *random* (the output of the [`RANDOM` opcode][random])
- L1 log entries emitted for [user deposits][g-deposits], augmented with a [sourceHash](./deposits.md#).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess confirmation depth stuff to be added later?

- `highest_valid_batch_timestamp = max(batch.timestamp for batch in filtered_batches)`,
or `0` if no there are no `filtered_batches`.
`batch.timestamp` refers to the L2 block timestamp encoded in the batch.
- `next_l1_timestamp` is the timestamp of the next L1 block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a copy/paste, but would be really nice to have some rationale for highest_valid_batch_timestamp being used (just a note to self to add this).


### Payload Attributes Deriver

Payload Attributes are derived by combining transactions derived from the referenced L1 origin, with transactions from the batch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably clarify L1 origin as the L1 block whose number matches the batch.

Copy link
Contributor

@trianglesphere trianglesphere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two design comments @protolambda

Comment on lines +77 to +78
frame_number = uvarint # identifies the index of frame_data in the channel
frame_data_length = uvarint # frame_data_length == len(frame_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change these to uint32s (could have frame # be uint16, need uint32 for data_length).


```text
CHANNEL_TIMEOUT = 600 # time in seconds until a channel is pruned from the channel bank
MAX_CHANNEL_BANK_SIZE = 100_000_000 # max size of the combined channels in the channel bank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a per channel max size + cap on the number of open channels is more resilient.

1. Read the version number of the L1 data, check if it is 0, the data is invalid if not.
2. Determine the `total_size` of the bank by summing the `size` of all channels in the bank.
For as long as `total_size > MAX_CHANNEL_BANK_SIZE` prune the first-seen channel (i.e. lowest `time`) and reduce `total_size` by that channel `size` .
The first-seen channel is pruned first, since this is the channel that fails to be read before the bank became this large.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should prune the last seen to enable the node to keep processing the open channel. There's also an argument that we should reopen future channels after we've cleared the current channel.

2. Reset the Batch Queue, starting at the common L1 origin found during the Engine Queue reset.
3. Reset the stages down until the Channel Bank, wiping the contents and copying the safe origin of the Batch Queue.
4. Reset the Channel Bank
5. Reset the L1 source by wiping any buffered data, and copying the safe origin of the Channel Bank.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are those "safe origins"? Are they different from one another and different from origin of safe L2 block?


After the resets are performed (possibly in smaller iterative chain traversal steps),
the regular pipeline derivation can dry-run to heal the buffer contents
(i.e. drop any outputs of a stage that lags behind the next stage closer to L2).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by the "regular pipeline"? Is there an "irregular pipeline" 😅?

And what does it mean to dry-run it? I have a feeling this is related to "reconciliation" of existing blocks in the execution engine with (re-)derived execution payload — is that it?

i.e. drop any outputs of a stage that lags behind the next stage closer to L2

I don't get this one.

The `unsafe` block starts as the last known tip of the L2 chain,
but is reset back to equal the `safe` block if at any point the `unsafe` block height is known but non-canonical.
An `unsafe` head with an origin beyond the `safe` origin is considered "plausible":
retaining the existing data is preferable, but it may be reorganized at a later point if consolidation with the `safe` chain fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset back to equal the safe block if at any point the unsafe block height is known but non-canonical

what about:

reset back to equal the safe block if at any point the unsafe block becomes non-canonical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: we probably need to define the concept of "origin" in the glossary.

@mergify
Copy link
Contributor

mergify bot commented Jun 29, 2022

Hey @protolambda! This PR has merge conflicts. Please fix them before continuing review.

@mergify mergify bot added the conflict label Jun 29, 2022
@protolambda protolambda force-pushed the batch-deriv-v2-spec branch from d6ab53f to b8ef653 Compare July 15, 2022 12:00
@mergify mergify bot removed the conflict label Jul 15, 2022
@protolambda protolambda force-pushed the batch-deriv-v2-spec branch from b8ef653 to b94c634 Compare July 15, 2022 12:32
theochap added a commit that referenced this pull request Dec 10, 2025
## Description

Fixes CI by:
- updating all dependencies (optimism-package, kurtosis, monorepo)
- fixing cargo hack failures caused by #2762 
- fixing udeps failure caused by #2762 
- fixing sysgo job failure upstream caused by #2762
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants