op-node: Prefer following seq drift instead of conf depth by trianglesphere · Pull Request #3861 · ethereum-optimism/optimism

trianglesphere · 2022-11-02T22:02:12Z

Description

We found a log on a testnet where a batch was rejected because it's timestamp was greater than the origin timestamp + sequencer drift. I looked into origin selection and determined we produce empty blocks if the block time is ahead of the origin + seq drift; however if there is a next origin, it is not enough to produce an empty block, we must change the origin.

I hypothesize that this occurred because the L1 head that the node was looking at was lagging. That would cause the node to keep using an old origin even if the timestamp was past the max drift. I added a unit test which shows this case & would fail without this fix being applied.

Tests

Unit tested with a regression test.

Metadata

Fixes ENG-2923

changeset-bot · 2022-11-02T22:02:15Z

⚠️ No Changeset found

Latest commit: 5f5a9d1

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

mergify · 2022-11-02T22:21:43Z

Hey @trianglesphere! This PR has merge conflicts. Please fix them before continuing review.

mergify · 2022-11-03T01:55:07Z

Hey @trianglesphere! This PR has merge conflicts. Please fix them before continuing review.

We found a log on a testnet where a batch was rejected because it's timestamp was greater than the origin timestamp + sequencer drift. I looked into origin selection and determined we produce empty blocks if the block time is ahead of the origin + seq drift; however if there is a next origin, it is not enough to produce an empty block, we must change the origin. I hypothesize that this occurred because the L1 head that the node was looking at was lagging. That would cause the node to keep using an old origin even if the timestamp was past the max drift. I added a unit test which shows this case & would fail without this fix being applied.

trianglesphere · 2022-11-03T16:25:52Z

@Mergifyio refresh

mergify · 2022-11-03T16:30:42Z

refresh

✅ Pull request refreshed

op-node/rollup/driver/origin_selector.go

protolambda

What happens if the next L1 origin time is more than the sequencer time drift away from the current L1 origin? It should keep using the same L1 origin and force the blocks to be empty iirc.

But with this check, we can't produce a L2 block with the an old origin anymore, and get stuck with sequencing I think, until block derivation produces the block.

And when the next L1 origin time is within the time drift span from the current origin, we can stay on the old origin safely I believe?

trianglesphere · 2022-11-03T22:30:54Z

What happens if the next L1 origin time is more than the sequencer time drift away from the current L1 origin?

Then it should compare the next L2 time vs the child block of the head L2 block's L1 Origin. Sequencer drift is only about the L2 block time vs the L1 origin time of that block.
This code is a little tricky in that at the point that it does the check for the sequencer drift, it knows that there is at least one more L1 block, but that L1 block may not be known to be confirmed to the node's satisfaction. This means that the LOS will not attempt to advance to a not found block at the sequencer drift check.

Note: If we lose connection to L1 for too long (more than the sequencer drift), the sequencer may produce blocks when it should not have. (It must produce empty block for all L2 times from l1Origin.Time + seq drift to lOriginNext.Time). However, if the sequencer produced empty blocks past l1OriginNext.Time because it did not have l1OriginNext, ignore those batches.

protolambda · 2022-11-03T23:31:57Z

@trianglesphere

Say if we have:

L2 chain blocks:
A, num=200, time=10, origin=X

L1 chain blocks;
X, num=100, time=1
Y, num=101, time=50

L2 block time = 2
L2 seq time drift = 10

Then with this PR:

if l2Head.Time+los.cfg.BlockTime > currentOrigin.Time+los.cfg.MaxSequencerDrift { select next origin }

if 10 + 2 > 1 + 10 { select next origin }

But the next origin time is way ahead: 50 > 10 + 2, the next L2 block cannot have this origin yet, or it breaks the time invariant that L2 time is >= L1 origin time.
And so unless L1 reorgs out the block, the L2 sequencer gets stuck because the origin it selects cannot actually be used for a new L2 block?

If the current origin is past the sequencer drift, it needs to ignore the conf depth staying, but does still need to check next L2 block time against the next origin time. I've also added a test for this case.

op-node/rollup/driver/origin_selector.go

trianglesphere · 2022-11-04T15:48:00Z

@protolambda I see what you mean. My mistake was that I was primarily considering an online sequencer where even if the next origin is past the seq drift, we would not receive it until the L2 block time was read - it's a bad assumption about timings.

I've fixed this so it ignores the conf depth if it is past the seq depth so it then uses the time check. I've also added a test case based on the times you gave.

norswap · 2022-11-04T18:01:39Z

Comments & clarifications. I'd need to do a deeper dive on the code to see what happens there, but I already have a lot to say based on the discussion alone:

I looked into origin selection and determined we produce empty blocks if the block time is ahead of the origin + seq drift; however if there is a next origin, it is not enough to produce an empty block, we must change the origin.

Confirming this, but reframing:

If we're ahead of origin + seq drift, the only allowed move is to create a new block with a new origin (since that moves the L2 time 2s but the moves L1 time at least 12s, hence reducing the drift).

Though not strictly necessary (in terms of properties), it's probably a good move to indeed make these blocks empty — the spec mandates it as we'll see.

What happens if the next L1 origin time is more than the sequencer time drift away from the current L1 origin? It should keep using the same L1 origin and force the blocks to be empty iirc.

Can this happen — that so many slots fail that two successive L1 blocks are very far away from one another?

But anyway, if I understand correctly, the issue here would be that that new origin's timestamp is actually now ahead of the L2 head, meaning we do need to output more empty blocks to catch up. This is definitely an edge case to consider.

In general, the algorithm should then be:

if nextOrigin.timestamp <= L2head.timestamp → create L2 block with next origin
otherwise, create empty blocks until nextOrigin.timestamp <= L2head.timestamp then create an L2 block with the next origin

And when the next L1 origin time is within the time drift span from the current origin, we can stay on the old origin safely I believe?

We can, but I think the right move is to adopt new origins as fast as possible to reduce the drift.

Btw, these conversations should be related back to the spec. There are two relevant snippets:

First one:

min_l2_timestamp <= block.timestamp < max_l2_timestamp, where
- all these values are denominated in seconds
- min_l2_timestamp = prev_l2_timestamp + l2_block_time
  - prev_l2_timestamp is the timestamp of the last L2 block of the previous epoch
  - l2_block_time is a configurable parameter of the time between L2 blocks (on Optimism, 2s)
- max_l2_timestamp = max(l1_timestamp + max_sequencer_drift, min_l2_timestamp + l2_block_time)
  - l1_timestamp is the timestamp of the L1 block associated with the L2 block's epoch
  - max_sequencer_drift is the most a sequencer is allowed to get ahead of L1

Second one:

batch.timestamp > batch_origin.time + max_sequencer_drift -> drop: i.e.
a batch that does not adopt the next L1 within time will be dropped, in favor
of an empty batch that can advance the L1 origin.

... // elided som more rules

If no batch can be accept-ed, and the stage has completed buffering of all batches that can fully be read from the L1 block at height epoch.number + sequence_window_size, and the next_epoch is available, then an empty batch can be derived with the following properties:

parent_hash = safe_l2_head.hash

timestamp = next_timestamp

transactions is empty, i.e. no sequencer transactions. Deposited transactions may be added in the next stage.

If next_timestamp < next_epoch.time: the current L1 origin is repeated, to preserve the L2 time invariant.

epoch_num = epoch.number

epoch_hash = epoch.hash

Otherwise,

epoch_num = next_epoch.number

epoch_hash = next_epoch.hash

where batch_origin == batch.origin.number, with caveat if the origin isn't
known, which I guess we should think about in the current case.

The first snippet seems correct: it allows us to go past the drift if we have to — though that should only be allowed if we reduce the drift.

I guess that's something it doesn't say, and also "max_sequencer_drift is the most a sequencer is allowed to get ahead of L1" is de facto false, so there's room for improvement there!

The second snippet seems correct as well! If we're past the drift, generate an empty block, and since we have a drift, we need to adopt the next origin (next_epoch).

tynes · 2022-11-04T19:42:22Z

op-node/rollup/driver/origin_selector.go

 	// the current origin block.
 	nextOrigin, err := los.l1.L1BlockRefByNumber(ctx, currentOrigin.Number+1)
 	if err != nil {
+		// TODO: this could result in a bad origin being selected if we are past the seq


What is the deal with this todo?

This would occur if pastSeqDrift is true & we fail to get the potential next L1 origin. In the case that the next L1 origin block time would have been valid (i.e. the next L2 time >= nextOrigin.Time) we would incorrectly return the currentOrigin instead of the next origin. This is a bit tricky because we don't know if the nextOrigin is valid until after we get it, so only part of the time the currentOrigin is invalid & part of the time it is valid.

protolambda

Change LGTM, and fixes part of the problem, but I think the discussion here + the new added todo highlights something that's still possible to go wrong on a testnet, and we should fix too. Let's discuss it on Monday.

trianglesphere force-pushed the jg/fix_los_selection branch from 3fc3794 to e684af2 Compare November 2, 2022 22:02

trianglesphere force-pushed the jg/los_tests branch from 6604552 to 2463c11 Compare November 2, 2022 22:20

mergify bot added the conflict label Nov 2, 2022

trianglesphere force-pushed the jg/los_tests branch from 2463c11 to 5a812b5 Compare November 2, 2022 22:41

trianglesphere force-pushed the jg/fix_los_selection branch from e684af2 to 060673f Compare November 2, 2022 22:45

mergify bot removed the conflict label Nov 2, 2022

Base automatically changed from jg/los_tests to develop November 3, 2022 01:53

mergify bot added the conflict label Nov 3, 2022

trianglesphere force-pushed the jg/fix_los_selection branch from 060673f to 3306cbe Compare November 3, 2022 16:24

trianglesphere marked this pull request as ready for review November 3, 2022 16:24

trianglesphere requested review from mslipper, protolambda and tynes November 3, 2022 16:25

mergify bot removed the conflict label Nov 3, 2022

tynes reviewed Nov 3, 2022

View reviewed changes

op-node/rollup/driver/origin_selector.go Outdated Show resolved Hide resolved

protolambda suggested changes Nov 3, 2022

View reviewed changes

op-node: Don't immediately attempt to advance origin

5f5a9d1

If the current origin is past the sequencer drift, it needs to ignore the conf depth staying, but does still need to check next L2 block time against the next origin time. I've also added a test for this case.

semgrep-app bot reviewed Nov 4, 2022

View reviewed changes

op-node/rollup/driver/origin_selector.go Show resolved Hide resolved

trianglesphere requested a review from protolambda November 4, 2022 15:48

tynes reviewed Nov 4, 2022

View reviewed changes

protolambda approved these changes Nov 5, 2022

View reviewed changes

mslipper approved these changes Nov 7, 2022

View reviewed changes

mslipper merged commit 9940c8c into develop Nov 7, 2022

mslipper deleted the jg/fix_los_selection branch November 7, 2022 05:26

mslipper mentioned this pull request Nov 10, 2022

Develop -> Master #3961

Merged

protolambda mentioned this pull request Jan 31, 2023

op-node: ensure sequencer selects L1 origin within sequencer time drift and use conf depth util #4808

Merged

Conversation

trianglesphere commented Nov 2, 2022

Uh oh!

changeset-bot bot commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

mergify bot commented Nov 2, 2022

Uh oh!

mergify bot commented Nov 3, 2022

Uh oh!

trianglesphere commented Nov 3, 2022

Uh oh!

mergify bot commented Nov 3, 2022

✅ Pull request refreshed

Uh oh!

Uh oh!

protolambda left a comment

Choose a reason for hiding this comment

Uh oh!

trianglesphere commented Nov 3, 2022

Uh oh!

protolambda commented Nov 3, 2022

Uh oh!

Uh oh!

trianglesphere commented Nov 4, 2022

Uh oh!

norswap commented Nov 4, 2022

Uh oh!

tynes Nov 4, 2022

Choose a reason for hiding this comment

Uh oh!

trianglesphere Nov 4, 2022

Choose a reason for hiding this comment

Uh oh!

protolambda left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

changeset-bot bot commented Nov 2, 2022 •

edited

Loading