Speed up altair block processing 2x #3115

arnetheduck · 2021-11-22T17:08:30Z

Like #3089, this PR drastially speeds up historical REST queries and
other long state replays.

cache sync committee validator indices
use ~80mb less memory for validator pubkey mappings
batch-verify sync aggregate signature (fixes [Performance] Batch-verify and cache verification of SyncAggregate #2985)
document sync committee hack with head block vs sync message block
add batch signature verification failure tests

Before:

../env.sh nim c -d:release -r ncli_db --db:mainnet_0/db bench --start-slot:-1000
All time are ms
     Average,       StdDev,          Min,          Max,      Samples,         Test
Validation is turned off meaning that no BLS operations are performed
    5830.675,        0.000,     5830.675,     5830.675,            1, Initialize DB
       0.481,        1.878,        0.215,       59.167,          981, Load block from database
    8422.566,        0.000,     8422.566,     8422.566,            1, Load state from database
       6.996,        1.678,        0.042,       14.385,          969, Advance slot, non-epoch
      93.217,        8.318,       84.192,      122.209,           32, Advance slot, epoch
      20.513,       23.665,       11.510,      201.561,          981, Apply block, no slot processing
       0.000,        0.000,        0.000,        0.000,            0, Database load
       0.000,        0.000,        0.000,        0.000,            0, Database store

After:

    7081.422,        0.000,     7081.422,     7081.422,            1, Initialize DB
       0.553,        2.122,        0.175,       66.692,          981, Load block from database
    5439.446,        0.000,     5439.446,     5439.446,            1, Load state from database
       6.829,        1.575,        0.043,       12.156,          969, Advance slot, non-epoch
      94.716,        2.749,       88.395,      100.026,           32, Advance slot, epoch
      11.636,       23.766,        4.889,      205.250,          981, Apply block, no slot processing
       0.000,        0.000,        0.000,        0.000,            0, Database load
       0.000,        0.000,        0.000,        0.000,            0, Database store

Like #3089, this PR drastially speeds up historical REST queries and other long state replays. * cache sync committee validator indices * use ~80mb less memory for validator pubkey mappings * batch-verify sync aggregate signature (fixes #2985) * document sync committee hack with head block vs sync message block * add batch signature verification failure tests Before: ``` ../env.sh nim c -d:release -r ncli_db --db:mainnet_0/db bench --start-slot:-1000 All time are ms Average, StdDev, Min, Max, Samples, Test Validation is turned off meaning that no BLS operations are performed 5830.675, 0.000, 5830.675, 5830.675, 1, Initialize DB 0.481, 1.878, 0.215, 59.167, 981, Load block from database 8422.566, 0.000, 8422.566, 8422.566, 1, Load state from database 6.996, 1.678, 0.042, 14.385, 969, Advance slot, non-epoch 93.217, 8.318, 84.192, 122.209, 32, Advance slot, epoch 20.513, 23.665, 11.510, 201.561, 981, Apply block, no slot processing 0.000, 0.000, 0.000, 0.000, 0, Database load 0.000, 0.000, 0.000, 0.000, 0, Database store ``` After: ``` 7081.422, 0.000, 7081.422, 7081.422, 1, Initialize DB 0.553, 2.122, 0.175, 66.692, 981, Load block from database 5439.446, 0.000, 5439.446, 5439.446, 1, Load state from database 6.829, 1.575, 0.043, 12.156, 969, Advance slot, non-epoch 94.716, 2.749, 88.395, 100.026, 32, Advance slot, epoch 11.636, 23.766, 4.889, 205.250, 981, Apply block, no slot processing 0.000, 0.000, 0.000, 0.000, 0, Database load 0.000, 0.000, 0.000, 0.000, 0, Database store ```

github-actions · 2021-11-22T18:27:36Z

Unit Test Results

    12 files ±0   748 suites +4 34m 33s ⏱️ + 3m 8s
1 467 tests +1 1 465 ✔️ +1 2 💤 ±0 0 ❌ ±0
8 936 runs +4 8 928 ✔️ +4 8 💤 ±0 0 ❌ ±0

Results for commit 8e445dc. ± Comparison against base commit 95dd846.

♻️ This comment has been updated with latest results.

zah · 2021-11-22T23:04:01Z

beacon_chain/spec/beaconstate.nim

@@ -744,9 +747,9 @@ proc get_next_sync_committee*(state: altair.BeaconState | merge.BeaconState):
  # see signatures_batch, TODO shouldn't be here
  # Deposit processing ensures all keys are valid
  var attestersAgg: AggregatePublicKey
-  attestersAgg.init(res.pubkeys.data[0].loadWithCache().get)


What's wrong with using the cache here?

The cache would only be used once per day for the 512 lookups and uses quite a lot of memory (absolute minimum 144 bytes per validator + hashset and threadvar overhead which is significant) - thanks to the index cache (about 4kb fixed size), the cost no longer matches the benefit. There are no other uses of this cache outside of test cases (I'd generally remove it).

There's an additional optimization that can be done to pre-seed the statecache with the index cache as long as the states line up, but that's for a separate PR - the main benefit is during replays where many blocks need to be applied - the creation of the index mapping is the elephant in the room.

Previously, the keys would end up in the cache anyway because the cache was also used in gossip validation. I guess it's now possible to introduce a more precise eviction of the cache around the update points of the SyncCommitteeCache (or the cooked keys could be stored in the SyncCommitteeCache, but this seems potentially sub-optimal if we end up needing the same validators keys in other caches).

we have two caches of the same data, where one would be enough: loadWithCache and validatorKey - the latter is tied to our database storage and should be preferred, also because constructing it is faster: loadWithCache works from compressed keys. If anything, we could keep cooked keys in the in-memory representation that backs validatorKey

beacon_chain/validators/validator_duties.nim

mratsim · 2021-11-23T08:13:50Z

beacon_chain/spec/signatures_batch.nim

@@ -414,4 +414,46 @@ proc collectSignatureSets*(
            volex.message.epoch,
            DOMAIN_VOLUNTARY_EXIT)

+  block:


For documentation it would be nice to have a comment section
7. SyncAggregate

8e445dc - there's not much to write though, like proposer signature, it's just a single field with a signature in it

mratsim · 2021-11-23T08:14:53Z

beacon_chain/validators/validator_duties.nim

-          var resCur: Table[ValidatorPubKey, int]
-          var resNxt: Table[ValidatorPubKey, int]
+          var resCur: Table[uint64, int]
+          var resNxt: Table[uint64, int]


Some alias type that tells us what this magic uint64 corresponds to?

they're uint64 values from the message that are repeated in the response, and in the message, they don't have any alias type - they're kind of validator indices, except we don't want to use ValidatorIndex because that's messy from a type size perspective.

zah · 2021-11-23T10:00:52Z

beacon_chain/gossip_processing/gossip_validation.nim

      syncCommitteeSlot,
      committeeIdx,
      msg.message.contribution.aggregation_bits):
-    let validatorPubKey = validatorPubKey.loadWithCache.get
+    let validatorPubKey = dag.validatorKey(validatorIndex)


The previous use of the cache here was quite important for performance. @mratsim, is dag.validatorKey(index) considered fast enough? (it's based internally on loadValid which in turn is based on fromBytesKnownOnCurve).

block_sim is a good benchmark for this, because it simulates the full network traffic around sync committees and their aggregation. The ncli_db bench just replays blocks, so it covers only the sync aggregate verification code.

ImmutableValidatorData2 stores uncompressed pubkeys that are quite fast to load - this was introduced a while ago, giving a decent performance boost: d859bc1

We're using this same approach for attestations and everything else, pretty much - if we're going to update it, we should update it across the board by changing the backend storage for validatorKey - sync message validation was the outlier here, and loadWithCache is in general a problematic construct.

Posting before/after block_sim results will be a convincing evidence that there is no performance regression.

I've added a benchmark for pubkeys and signature deserialization.
On a 11th gen Intel it's 14 µs per key.

status-im/nim-blscurve#126

that said, it's likely a valid optimization point regardless, to load the data into cooked pubkeys on startup.

I'm not sure 13440 is relevant here - different data set, multiple flaws in methodology

Yes, if loading uncompressed keys is so fast, loading them on start-up shouldn't be a problem either.

separate PR though, since it affects attestations significantly

arnetheduck · 2021-11-25T07:45:59Z

@mratsim what are you measuring there btw? compressed or uncompressed deserialization?

arnetheduck changed the title ~~Speed up altair block processing >2x~~ Speed up altair block processing 2x Nov 22, 2021

zah reviewed Nov 22, 2021

View reviewed changes

mratsim reviewed Nov 23, 2021

View reviewed changes

add comment

8e445dc

zah reviewed Nov 23, 2021

View reviewed changes

arnetheduck merged commit 9c2f43e into unstable Nov 24, 2021

arnetheduck deleted the speedups-20211122 branch November 24, 2021 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up altair block processing 2x #3115

Speed up altair block processing 2x #3115

arnetheduck commented Nov 22, 2021

github-actions bot commented Nov 22, 2021 •

edited

Loading

zah Nov 22, 2021

arnetheduck Nov 23, 2021 •

edited

Loading

zah Nov 23, 2021

arnetheduck Nov 23, 2021

mratsim Nov 23, 2021

arnetheduck Nov 23, 2021

mratsim Nov 23, 2021

arnetheduck Nov 23, 2021

zah Nov 23, 2021

zah Nov 23, 2021

arnetheduck Nov 23, 2021

zah Nov 23, 2021

mratsim Nov 23, 2021

arnetheduck Nov 23, 2021

arnetheduck Nov 23, 2021

zah Nov 23, 2021 •

edited

Loading

arnetheduck Nov 23, 2021

arnetheduck Nov 25, 2021

arnetheduck commented Nov 25, 2021

Speed up altair block processing 2x #3115

Speed up altair block processing 2x #3115

Conversation

arnetheduck commented Nov 22, 2021

github-actions bot commented Nov 22, 2021 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

arnetheduck Nov 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zah Nov 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck commented Nov 25, 2021

github-actions bot commented Nov 22, 2021 •

edited

Loading

arnetheduck Nov 23, 2021 •

edited

Loading

zah Nov 23, 2021 •

edited

Loading