Skip to content

Use poh grace ticks when new reset bank is pending#794

Merged
mergify[bot] merged 3 commits intoanza-xyz:masterfrom
jstarry:feat/smarter-grace-ticks
Apr 18, 2024
Merged

Use poh grace ticks when new reset bank is pending#794
mergify[bot] merged 3 commits intoanza-xyz:masterfrom
jstarry:feat/smarter-grace-ticks

Conversation

@jstarry
Copy link
Copy Markdown

@jstarry jstarry commented Apr 13, 2024

Problem

I've observed that a lot of skipped slots are caused when a leader "A" tries to skip the previous leader "B" when producing its blocks but the previous leader had actually already started broadcasting shreds for its blocks. In this case, there's a race between the forks of leader A and B and it's almost certain that leader B's fork will be confirmed because its block will likely be finished and replayed by the cluster before leader A's block that only just started being produced.

Summary of Changes

Currently when leader A skips all of leader B's blocks, it doesn't use grace ticks when deciding when to start building its block. After this PR, when a leader skips all of the previous leader's blocks and has ticked to its leader slot (without grace ticks) it will first check if it has received any shreds for a potential new reset bank. If it has received some shreds, it will apply grace ticks to wait a bit longer to allow time for the corresponding bank for those shreds to be frozen (or marked dead). If it hasn't received any shreds, it can go ahead with producing its block without waiting for grace ticks as before.

  • Added new hidden validator cli arg --delay-leader-block-for-pending-fork which allows validators to opt into this new behavior.
  • Added new metric poh_recorder-detected_pending_fork which reports when a pending fork was detected and whether or not the validator yielded

Fixes #

@jstarry jstarry added the v1.18 label Apr 13, 2024
@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 13, 2024

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

@jstarry
Copy link
Copy Markdown
Author

jstarry commented Apr 13, 2024

I would like to backport to v1.18 because I believe it will have a notable impact on skip rate and I would like to see it get rolled out on testnet before mainnet

@jstarry jstarry force-pushed the feat/smarter-grace-ticks branch from 275fa30 to 43df692 Compare April 13, 2024 14:03
@jstarry jstarry requested review from AshwinSekar and carllin April 13, 2024 14:37
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.9%. Comparing base (09241ae) to head (43df692).
Report is 3 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #794   +/-   ##
=======================================
  Coverage    81.9%    81.9%           
=======================================
  Files         851      851           
  Lines      231504   231542   +38     
=======================================
+ Hits       189731   189796   +65     
+ Misses      41773    41746   -27     

@carllin
Copy link
Copy Markdown

carllin commented Apr 15, 2024

This is an interesting PR, implements a similar idea to yielding to slow leaders that @AshwinSekar and I had been discussing in previous weeks.

Wondering how we can experimentally verify this causes fewer forks. Can we

  1. Add a metric to track to when we're detecting a slower leader here:
    && !self.is_new_reset_bank_pending(next_slot)
  2. Add flag/ability to either yield or not yield

We then compare the metric from 1 between validators with 2 turned on and off, how many got their slots included vs not included into the main fork.

@jstarry
Copy link
Copy Markdown
Author

jstarry commented Apr 16, 2024

@carllin I added a metric which reports the leader slot if we detect a pending fork and whether or not we decided to yield for some grace ticks. We can then check whether those reported leader slots were confirmed or not.

@jstarry jstarry force-pushed the feat/smarter-grace-ticks branch from d49e98c to 971380e Compare April 16, 2024 08:05
carllin
carllin previously approved these changes Apr 17, 2024
@AshwinSekar
Copy link
Copy Markdown

If we're experimenting can we make the flag hidden, at least for the backport to 1.18? It seems a couple staked validators against mainnet would suffice to test this out?

@jstarry
Copy link
Copy Markdown
Author

jstarry commented Apr 18, 2024

@AshwinSekar made it hidden, can I get another approval?

@jstarry jstarry added the automerge automerge Merge this Pull Request automatically once CI passes label Apr 18, 2024
@mergify mergify Bot merged commit 1c1b4c3 into anza-xyz:master Apr 18, 2024
mergify Bot pushed a commit that referenced this pull request Apr 18, 2024
* Use poh grace ticks when new reset bank is pending

* feedback

* make it hidden

(cherry picked from commit 1c1b4c3)

# Conflicts:
#	core/src/validator.rs
#	local-cluster/src/validator_configs.rs
#	validator/src/main.rs
@jstarry jstarry deleted the feat/smarter-grace-ticks branch April 19, 2024 02:15
michaelschem pushed a commit to michaelschem/agave that referenced this pull request Apr 20, 2024
* Use poh grace ticks when new reset bank is pending

* feedback

* make it hidden
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge automerge Merge this Pull Request automatically once CI passes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants