Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up altair block processing 2x #3115

Merged
merged 2 commits into from
Nov 24, 2021
Merged

Speed up altair block processing 2x #3115

merged 2 commits into from
Nov 24, 2021

Conversation

arnetheduck
Copy link
Member

Like #3089, this PR drastially speeds up historical REST queries and
other long state replays.

Before:

../env.sh nim c -d:release -r ncli_db --db:mainnet_0/db bench --start-slot:-1000
All time are ms
     Average,       StdDev,          Min,          Max,      Samples,         Test
Validation is turned off meaning that no BLS operations are performed
    5830.675,        0.000,     5830.675,     5830.675,            1, Initialize DB
       0.481,        1.878,        0.215,       59.167,          981, Load block from database
    8422.566,        0.000,     8422.566,     8422.566,            1, Load state from database
       6.996,        1.678,        0.042,       14.385,          969, Advance slot, non-epoch
      93.217,        8.318,       84.192,      122.209,           32, Advance slot, epoch
      20.513,       23.665,       11.510,      201.561,          981, Apply block, no slot processing
       0.000,        0.000,        0.000,        0.000,            0, Database load
       0.000,        0.000,        0.000,        0.000,            0, Database store

After:

    7081.422,        0.000,     7081.422,     7081.422,            1, Initialize DB
       0.553,        2.122,        0.175,       66.692,          981, Load block from database
    5439.446,        0.000,     5439.446,     5439.446,            1, Load state from database
       6.829,        1.575,        0.043,       12.156,          969, Advance slot, non-epoch
      94.716,        2.749,       88.395,      100.026,           32, Advance slot, epoch
      11.636,       23.766,        4.889,      205.250,          981, Apply block, no slot processing
       0.000,        0.000,        0.000,        0.000,            0, Database load
       0.000,        0.000,        0.000,        0.000,            0, Database store

Like #3089, this PR drastially speeds up historical REST queries and
other long state replays.

* cache sync committee validator indices
* use ~80mb less memory for validator pubkey mappings
* batch-verify sync aggregate signature (fixes #2985)
* document sync committee hack with head block vs sync message block
* add batch signature verification failure tests

Before:

```
../env.sh nim c -d:release -r ncli_db --db:mainnet_0/db bench --start-slot:-1000
All time are ms
     Average,       StdDev,          Min,          Max,      Samples,         Test
Validation is turned off meaning that no BLS operations are performed
    5830.675,        0.000,     5830.675,     5830.675,            1, Initialize DB
       0.481,        1.878,        0.215,       59.167,          981, Load block from database
    8422.566,        0.000,     8422.566,     8422.566,            1, Load state from database
       6.996,        1.678,        0.042,       14.385,          969, Advance slot, non-epoch
      93.217,        8.318,       84.192,      122.209,           32, Advance slot, epoch
      20.513,       23.665,       11.510,      201.561,          981, Apply block, no slot processing
       0.000,        0.000,        0.000,        0.000,            0, Database load
       0.000,        0.000,        0.000,        0.000,            0, Database store
```

After:

```
    7081.422,        0.000,     7081.422,     7081.422,            1, Initialize DB
       0.553,        2.122,        0.175,       66.692,          981, Load block from database
    5439.446,        0.000,     5439.446,     5439.446,            1, Load state from database
       6.829,        1.575,        0.043,       12.156,          969, Advance slot, non-epoch
      94.716,        2.749,       88.395,      100.026,           32, Advance slot, epoch
      11.636,       23.766,        4.889,      205.250,          981, Apply block, no slot processing
       0.000,        0.000,        0.000,        0.000,            0, Database load
       0.000,        0.000,        0.000,        0.000,            0, Database store
```
@arnetheduck arnetheduck changed the title Speed up altair block processing >2x Speed up altair block processing 2x Nov 22, 2021
@github-actions
Copy link

github-actions bot commented Nov 22, 2021

Unit Test Results

     12 files  ±0     748 suites  +4   34m 33s ⏱️ + 3m 8s
1 467 tests +1  1 465 ✔️ +1  2 💤 ±0  0 ±0 
8 936 runs  +4  8 928 ✔️ +4  8 💤 ±0  0 ±0 

Results for commit 8e445dc. ± Comparison against base commit 95dd846.

♻️ This comment has been updated with latest results.

@@ -744,9 +747,9 @@ proc get_next_sync_committee*(state: altair.BeaconState | merge.BeaconState):
# see signatures_batch, TODO shouldn't be here
# Deposit processing ensures all keys are valid
var attestersAgg: AggregatePublicKey
attestersAgg.init(res.pubkeys.data[0].loadWithCache().get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with using the cache here?

Copy link
Member Author

@arnetheduck arnetheduck Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache would only be used once per day for the 512 lookups and uses quite a lot of memory (absolute minimum 144 bytes per validator + hashset and threadvar overhead which is significant) - thanks to the index cache (about 4kb fixed size), the cost no longer matches the benefit. There are no other uses of this cache outside of test cases (I'd generally remove it).

There's an additional optimization that can be done to pre-seed the statecache with the index cache as long as the states line up, but that's for a separate PR - the main benefit is during replays where many blocks need to be applied - the creation of the index mapping is the elephant in the room.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the keys would end up in the cache anyway because the cache was also used in gossip validation. I guess it's now possible to introduce a more precise eviction of the cache around the update points of the SyncCommitteeCache (or the cooked keys could be stored in the SyncCommitteeCache, but this seems potentially sub-optimal if we end up needing the same validators keys in other caches).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have two caches of the same data, where one would be enough: loadWithCache and validatorKey - the latter is tied to our database storage and should be preferred, also because constructing it is faster: loadWithCache works from compressed keys. If anything, we could keep cooked keys in the in-memory representation that backs validatorKey

beacon_chain/validators/validator_duties.nim Show resolved Hide resolved
@@ -414,4 +414,46 @@ proc collectSignatureSets*(
volex.message.epoch,
DOMAIN_VOLUNTARY_EXIT)

block:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For documentation it would be nice to have a comment section
7. SyncAggregate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8e445dc - there's not much to write though, like proposer signature, it's just a single field with a signature in it

var resCur: Table[ValidatorPubKey, int]
var resNxt: Table[ValidatorPubKey, int]
var resCur: Table[uint64, int]
var resNxt: Table[uint64, int]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some alias type that tells us what this magic uint64 corresponds to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they're uint64 values from the message that are repeated in the response, and in the message, they don't have any alias type - they're kind of validator indices, except we don't want to use ValidatorIndex because that's messy from a type size perspective.

syncCommitteeSlot,
committeeIdx,
msg.message.contribution.aggregation_bits):
let validatorPubKey = validatorPubKey.loadWithCache.get
let validatorPubKey = dag.validatorKey(validatorIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous use of the cache here was quite important for performance. @mratsim, is dag.validatorKey(index) considered fast enough? (it's based internally on loadValid which in turn is based on fromBytesKnownOnCurve).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_sim is a good benchmark for this, because it simulates the full network traffic around sync committees and their aggregation. The ncli_db bench just replays blocks, so it covers only the sync aggregate verification code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableValidatorData2 stores uncompressed pubkeys that are quite fast to load - this was introduced a while ago, giving a decent performance boost: d859bc1

We're using this same approach for attestations and everything else, pretty much - if we're going to update it, we should update it across the board by changing the backend storage for validatorKey - sync message validation was the outlier here, and loadWithCache is in general a problematic construct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting before/after block_sim results will be a convincing evidence that there is no performance regression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a benchmark for pubkeys and signature deserialization.
On a 11th gen Intel it's 14 µs per key.

image

status-im/nim-blscurve#126

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that said, it's likely a valid optimization point regardless, to load the data into cooked pubkeys on startup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure 13440 is relevant here - different data set, multiple flaws in methodology

Copy link
Contributor

@zah zah Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if loading uncompressed keys is so fast, loading them on start-up shouldn't be a problem either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate PR though, since it affects attestations significantly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnetheduck arnetheduck merged commit 9c2f43e into unstable Nov 24, 2021
@arnetheduck arnetheduck deleted the speedups-20211122 branch November 24, 2021 12:43
@arnetheduck
Copy link
Member Author

@mratsim what are you measuring there btw? compressed or uncompressed deserialization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants