Skip to content

op-node: Prefer following seq drift instead of conf depth#3861

Merged
mslipper merged 2 commits intodevelopfrom
jg/fix_los_selection
Nov 7, 2022
Merged

op-node: Prefer following seq drift instead of conf depth#3861
mslipper merged 2 commits intodevelopfrom
jg/fix_los_selection

Conversation

@trianglesphere
Copy link
Copy Markdown
Contributor

Description

We found a log on a testnet where a batch was rejected because it's timestamp was greater than the origin timestamp + sequencer drift. I looked into origin selection and determined we produce empty blocks if the block time is ahead of the origin + seq drift; however if there is a next origin, it is not enough to produce an empty block, we must change the origin.

I hypothesize that this occurred because the L1 head that the node was looking at was lagging. That would cause the node to keep using an old origin even if the timestamp was past the max drift. I added a unit test which shows this case & would fail without this fix being applied.

Tests

Unit tested with a regression test.

Metadata

  • Fixes ENG-2923

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Nov 2, 2022

⚠️ No Changeset found

Latest commit: 5f5a9d1

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Nov 2, 2022

Hey @trianglesphere! This PR has merge conflicts. Please fix them before continuing review.

@mergify mergify bot added the conflict label Nov 2, 2022
@mergify mergify bot removed the conflict label Nov 2, 2022
Base automatically changed from jg/los_tests to develop November 3, 2022 01:53
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Nov 3, 2022

Hey @trianglesphere! This PR has merge conflicts. Please fix them before continuing review.

@mergify mergify bot added the conflict label Nov 3, 2022
We found a log on a testnet where a batch was rejected because it's
timestamp was greater than the origin timestamp + sequencer drift.
I looked into origin selection and determined we produce empty blocks
if the block time is ahead of the origin + seq drift; however if there
is a next origin, it is not enough to produce an empty block, we must
change the origin.

I hypothesize that this occurred because the L1 head that the node was
looking at was lagging. That would cause the node to keep using an
old origin even if the timestamp was past the max drift. I added
a unit test which shows this case & would fail without this fix
being applied.
@trianglesphere trianglesphere marked this pull request as ready for review November 3, 2022 16:24
@trianglesphere
Copy link
Copy Markdown
Contributor Author

@Mergifyio refresh

@mergify mergify bot removed the conflict label Nov 3, 2022
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Nov 3, 2022

refresh

✅ Pull request refreshed

Copy link
Copy Markdown
Contributor

@protolambda protolambda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the next L1 origin time is more than the sequencer time drift away from the current L1 origin? It should keep using the same L1 origin and force the blocks to be empty iirc.

But with this check, we can't produce a L2 block with the an old origin anymore, and get stuck with sequencing I think, until block derivation produces the block.

And when the next L1 origin time is within the time drift span from the current origin, we can stay on the old origin safely I believe?

@trianglesphere
Copy link
Copy Markdown
Contributor Author

What happens if the next L1 origin time is more than the sequencer time drift away from the current L1 origin?

Then it should compare the next L2 time vs the child block of the head L2 block's L1 Origin. Sequencer drift is only about the L2 block time vs the L1 origin time of that block.
This code is a little tricky in that at the point that it does the check for the sequencer drift, it knows that there is at least one more L1 block, but that L1 block may not be known to be confirmed to the node's satisfaction. This means that the LOS will not attempt to advance to a not found block at the sequencer drift check.

Note: If we lose connection to L1 for too long (more than the sequencer drift), the sequencer may produce blocks when it should not have. (It must produce empty block for all L2 times from l1Origin.Time + seq drift to lOriginNext.Time). However, if the sequencer produced empty blocks past l1OriginNext.Time because it did not have l1OriginNext, ignore those batches.

@protolambda
Copy link
Copy Markdown
Contributor

@trianglesphere

Say if we have:

L2 chain blocks:
A, num=200, time=10, origin=X

L1 chain blocks;
X, num=100, time=1
Y, num=101, time=50

L2 block time = 2
L2 seq time drift = 10

Then with this PR:

if l2Head.Time+los.cfg.BlockTime > currentOrigin.Time+los.cfg.MaxSequencerDrift { select next origin }

if 10 + 2 > 1 + 10 { select next origin }

But the next origin time is way ahead: 50 > 10 + 2, the next L2 block cannot have this origin yet, or it breaks the time invariant that L2 time is >= L1 origin time.
And so unless L1 reorgs out the block, the L2 sequencer gets stuck because the origin it selects cannot actually be used for a new L2 block?

If the current origin is past the sequencer drift, it needs to
ignore the conf depth staying, but does still need to check next
L2 block time against the next origin time.

I've also added a test for this case.
@trianglesphere
Copy link
Copy Markdown
Contributor Author

@protolambda I see what you mean. My mistake was that I was primarily considering an online sequencer where even if the next origin is past the seq drift, we would not receive it until the L2 block time was read - it's a bad assumption about timings.

I've fixed this so it ignores the conf depth if it is past the seq depth so it then uses the time check. I've also added a test case based on the times you gave.

@norswap
Copy link
Copy Markdown
Contributor

norswap commented Nov 4, 2022

Comments & clarifications. I'd need to do a deeper dive on the code to see what happens there, but I already have a lot to say based on the discussion alone:

I looked into origin selection and determined we produce empty blocks if the block time is ahead of the origin + seq drift; however if there is a next origin, it is not enough to produce an empty block, we must change the origin.

Confirming this, but reframing:

If we're ahead of origin + seq drift, the only allowed move is to create a new block with a new origin (since that moves the L2 time 2s but the moves L1 time at least 12s, hence reducing the drift).

Though not strictly necessary (in terms of properties), it's probably a good move to indeed make these blocks empty — the spec mandates it as we'll see.

What happens if the next L1 origin time is more than the sequencer time drift away from the current L1 origin? It should keep using the same L1 origin and force the blocks to be empty iirc.

Can this happen — that so many slots fail that two successive L1 blocks are very far away from one another?

But anyway, if I understand correctly, the issue here would be that that new origin's timestamp is actually now ahead of the L2 head, meaning we do need to output more empty blocks to catch up. This is definitely an edge case to consider.

In general, the algorithm should then be:

  • if nextOrigin.timestamp <= L2head.timestamp → create L2 block with next origin
  • otherwise, create empty blocks until nextOrigin.timestamp <= L2head.timestamp then create an L2 block with the next origin

And when the next L1 origin time is within the time drift span from the current origin, we can stay on the old origin safely I believe?

We can, but I think the right move is to adopt new origins as fast as possible to reduce the drift.


Btw, these conversations should be related back to the spec. There are two relevant snippets:

First one:

  • min_l2_timestamp <= block.timestamp < max_l2_timestamp, where
    • all these values are denominated in seconds
    • min_l2_timestamp = prev_l2_timestamp + l2_block_time
      • prev_l2_timestamp is the timestamp of the last L2 block of the previous epoch
      • l2_block_time is a configurable parameter of the time between L2 blocks (on Optimism, 2s)
    • max_l2_timestamp = max(l1_timestamp + max_sequencer_drift, min_l2_timestamp + l2_block_time)
      • l1_timestamp is the timestamp of the L1 block associated with the L2 block's epoch
      • max_sequencer_drift is the most a sequencer is allowed to get ahead of L1

Second one:

  • batch.timestamp > batch_origin.time + max_sequencer_drift -> drop: i.e.
    a batch that does not adopt the next L1 within time will be dropped, in favor
    of an empty batch that can advance the L1 origin.

... // elided som more rules

If no batch can be accept-ed, and the stage has completed buffering of all batches that can fully be read from the L1 block at height epoch.number + sequence_window_size, and the next_epoch is available, then an empty batch can be derived with the following properties:

  • parent_hash = safe_l2_head.hash
  • timestamp = next_timestamp
  • transactions is empty, i.e. no sequencer transactions. Deposited transactions may be added in the next stage.
  • If next_timestamp < next_epoch.time: the current L1 origin is repeated, to preserve the L2 time invariant.
    • epoch_num = epoch.number
    • epoch_hash = epoch.hash
  • Otherwise,
    • epoch_num = next_epoch.number
    • epoch_hash = next_epoch.hash

where batch_origin == batch.origin.number, with caveat if the origin isn't
known, which I guess we should think about in the current case.

The first snippet seems correct: it allows us to go past the drift if we have to — though that should only be allowed if we reduce the drift.

I guess that's something it doesn't say, and also "max_sequencer_drift is the most a sequencer is allowed to get ahead of L1" is de facto false, so there's room for improvement there!

The second snippet seems correct as well! If we're past the drift, generate an empty block, and since we have a drift, we need to adopt the next origin (next_epoch).

// the current origin block.
nextOrigin, err := los.l1.L1BlockRefByNumber(ctx, currentOrigin.Number+1)
if err != nil {
// TODO: this could result in a bad origin being selected if we are past the seq
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the deal with this todo?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would occur if pastSeqDrift is true & we fail to get the potential next L1 origin. In the case that the next L1 origin block time would have been valid (i.e. the next L2 time >= nextOrigin.Time) we would incorrectly return the currentOrigin instead of the next origin. This is a bit tricky because we don't know if the nextOrigin is valid until after we get it, so only part of the time the currentOrigin is invalid & part of the time it is valid.

Copy link
Copy Markdown
Contributor

@protolambda protolambda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change LGTM, and fixes part of the problem, but I think the discussion here + the new added todo highlights something that's still possible to go wrong on a testnet, and we should fix too. Let's discuss it on Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants