Lazy aggregated batch verification #3212

arnetheduck · 2021-12-19T23:43:50Z

A novel optimisation for attestation and sync committee message
validation: when batching, we look for signatures of the same message
and aggregate these before batch-validating: this results in up to 60%
fewer signature verifications on a busy server, leading to a significant
reduction in CPU usage.

increase batch size slightly which helps finding more aggregates
add metrics for batch verification efficency
use simple blsVerify when there is only one signature to verify in
the batch, avoiding the RNG

A novel optimisation for attestation and sync committee message validation: when batching, we look for signatures of the same message and aggregate these before batch-validating: this results in 10-20% fewer signature verifications on a busy server, leading to a significant reduction in CPU usage. * increase batch size slightly which helps finding more aggregates * add metrics for batch verification efficency * use simple `blsVerify` when there is only one signature to verify in the batch, avoiding the RNG

github-actions · 2021-12-20T01:06:31Z

Unit Test Results

    12 files ±0   763 suites +4 36m 51s ⏱️ + 1m 31s
1 537 tests +2 1 491 ✔️ +2 46 💤 ±0 0 ❌ ±0
9 133 runs +7 9 037 ✔️ +7 96 💤 ±0 0 ❌ ±0

Results for commit c77150d. ± Comparison against base commit c7be88b.

♻️ This comment has been updated with latest results.

tersec · 2021-12-20T09:15:38Z

beacon_chain/gossip_processing/batch_validation.nim

@@ -74,7 +89,7 @@ const
  # (RNG for blinding and Final Exponentiation)
  # are amortized,
  # but not too big as we need to redo checks one-by-one if one failed.
-  BatchedCryptoSize = 16
+  BatchedCryptoSize = 24


Are there measurements on how often failures necessitating backtracking/retrying the batching occur, which would become more acute with larger batches?

The only time I've seen them is when we had a bug in the signature verification function - by and large, during an attack, each node will locally discard any faulty signatures so they do not spread - that means that only during a targeted attack can there be a failure here, against an individual node

arnetheduck · 2021-12-21T16:55:39Z

This shows a mainnet node at batchSize = 24, 32 and 72 respectively.

the average number of signatures per aggregate, although not visible due to scale, hovers around 1.15 with 24, 32 and increases to 1.2 with 72
the average number of signatures per batch is naturally lower than the max because of the burst:iness of the data in general meaning some batches are full and some are not, but with the large batch size we can clearly see that the available space stops being used
72 is interesting because that's roughly how many subnets we have (64 attestation subnets + 4 sync committee subnets + some others)

One issue with the way that validation is set up is that in libp2p, the "next" gossip message from a single node is not processed until the current one has been validated - this decreases concurrency, and therefore also the ability to build meaningful batches - this is specially problematic for lazy aggregation because one node is likely to be sending multiple messages for the same data when dealing with an attestation burst: all attestations on a single subnet will typically have the same message being signed, and nodes listen to particular subnets - cc @Menduist @dryajov

arnetheduck · 2021-12-22T17:01:14Z

wow, massive difference with concurrent gossip processing - number of signatures per aggregate goes up from 1.2 to 1.9 which is huge - at 2, we'd be verifying half as many signatures

mratsim

LGTM

I prefer constructing the SignatureSet in-place in the sequence because returning with init an extra-large datatypes (and in this case a tuple) tends to create very inefficient code in Nim with lots of memset(0) and here probably large stack allocation/reclaim.

Nitpick: This high batch number might need to be a compile-time const for minimal-sized testnets? If the committees/attestations never reach 72, we would always have a high 30ms delay.

mratsim · 2021-12-23T17:59:51Z

beacon_chain/gossip_processing/batch_validation.nim

-    pendingBuffer: seq[SignatureSet]
-    resultsBuffer: seq[Future[BatchResult]]
+    sigsets: seq[SignatureSet]
+    items: seq[BatchItem]

  BatchCrypto* = object
    # Each batch is bounded by BatchedCryptoSize (16) which was chosen:


Comment need to be updated to 72

yeah, I'm playing around with different values to strike a good balance - it turns out that in practice, on a really heavily loaded server the average batch size on mainnet, after applying the libp2p fix to validate messages concurrently:

max = 72 gives 28-30 signatures per batch and 2.4-2.6 signatures per aggregate

max = 36 gives 17-18 signatures per batch and 1.7-1.9 signatures per aggregate

This suggests that some batches are large, but most are much smaller than the max - basically, when max is large, we occasionally use the given space, but not always - it's also likely that the additional space gives aggregation a better chance simply because we collect signatures from different subnets (and it is only per subnet that aggregation can happen).

An earlier version tried using a separate batcher per subnet explicitly, but that doesn't work very well at all - there's not enough traffic on a single subnet to consistently fill up the batches, so overall, it lowers the efficiency of batching.

high 30ms delay.

The "maximum" delay when the batch is not full remains at 10ms, or whenever the async loop wakes up. I suspect increasing this timeout would improve batching and aggregation, but it would indeed hurt the case you're describing.

72 vs 36, mainnet server

Important to note is that "signatures per batch" != "verifications per batch": when we have 2.5 signatures per aggregate, we're performing 30/2.5 = 12 verifications per batch.

moved and expanded commentary - back to 72, it's ridiculously more efficient in terms of throughput and helps significantly with maintaining a healthy mesh on a slow node giving it more time to deal with anomolies

mratsim · 2021-12-23T18:06:46Z

beacon_chain/gossip_processing/batch_validation.nim

+      startTick = Moment.now()
+      ok =
+        if batchSize == 1: blsVerify(batch[].sigsets[0])
+        else: batchCrypto.verifier.batchVerify(batch[].sigsets)


It might be better to move this in nim-blscurve, thoughts?

sure, that makes a lot of sense - this PR or separate?

ok, that was easier said than done: nim-blscurve doesn't know about the RNG so it can't avoid generating randomness in case of a single signature. Leaving for a potential future refactoring..

mratsim · 2021-12-23T18:10:20Z

beacon_chain/spec/signatures_batch.nim

-    signature: CookedSig) =
+func init(T: type SignatureSet,
+    pubkey: CookedPubKey, signing_root: Eth2Digest,
+    signature: CookedSig): T =


Returning large value types tend to produce quite inefficient code and a SigSet is huge.

this should be fixed as of nim-lang/Nim#19115 which landed in the nim 1.2 branch as well

This is the generated code: no memsets, RVO working as it should - I suspect you've been burnt by the nim bug previously:

N_LIB_PRIVATE N_NIMCALL(void, init__7Ffc9cVDV9aLbtP7bpwJ5yNA)(tyObject_PublicKey__wF9cg4i4Nl9cNaHoaQ6NXiVA *pubkey, tyObject_MDigest__law9ct65KplMYBvtmjCQxbw *signing_root, tyObject_Signature__ZPec3zScfAlRHnqARj9a9asg *signature, tyTuple__7PghGCO9ajM9a39c4M2A31RqQ *Result) { (*Result).Field0 = (*pubkey); nimCopyMem((void *)(*Result).Field1, (NIM_CONST void *)(*signing_root).data, sizeof(tyArray__vEOa9c5qaE9ajWxR5R4zwfQg)); (*Result).Field2 = (*signature); }

arnetheduck · 2021-12-29T14:27:36Z

tends to create very inefficient code

ok, I've dug around a bit and couldn't find any instances of this happening with the new function call style - as noted above, this was likely the result of a bug in the nim compiler that has been fixed since.

Followup of #3212 to test proper signature verification. Also document possible further optimization based on blst `v0.3.13`.

tersec reviewed Dec 20, 2021

View reviewed changes

arnetheduck added 2 commits December 21, 2021 15:15

test batchSize = 32

222dd03

test batchSize = 72

7dd8433

Menduist mentioned this pull request Dec 22, 2021

Add more concurrency to Gossipsub validation vacp2p/nim-libp2p#679

Closed

test concurrent message processing

5602c78

arnetheduck marked this pull request as draft December 22, 2021 16:26

mratsim approved these changes Dec 23, 2021

View reviewed changes

arnetheduck added 2 commits December 28, 2021 09:48

test BatchedCryptoSize = 36 (+ libp2p fix)

ae89af6

docs, switch back to 72 as max batch size

c77150d

arnetheduck marked this pull request as ready for review December 29, 2021 14:18

arnetheduck merged commit 6b60a77 into unstable Dec 29, 2021

arnetheduck deleted the lazy-batch-aggregate branch December 29, 2021 14:28

dapplion mentioned this pull request Jan 13, 2022

Lazy aggregated batch verification ChainSafe/lodestar#3612

Open

etan-status added a commit that referenced this pull request Jul 26, 2024

add test for shuffled attestation signatures

c9d1912

Followup of #3212 to test proper signature verification. Also document possible further optimization based on blst `v0.3.13`.

etan-status added a commit that referenced this pull request Jul 26, 2024

add test for shuffled attestation signatures

4874e74

Followup of #3212 to test proper signature verification. Also document possible further optimization based on blst `v0.3.13`.

etan-status mentioned this pull request Jul 26, 2024

add test for shuffled attestation signatures #6459

Merged

tersec pushed a commit that referenced this pull request Jul 26, 2024

add test for shuffled attestation signatures (#6459)

ea16edd

Followup of #3212 to test proper signature verification. Also document possible further optimization based on blst `v0.3.13`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy aggregated batch verification #3212

Lazy aggregated batch verification #3212

arnetheduck commented Dec 19, 2021 •

edited

Loading

github-actions bot commented Dec 20, 2021 •

edited

Loading

tersec Dec 20, 2021

arnetheduck Dec 20, 2021

arnetheduck commented Dec 21, 2021

arnetheduck commented Dec 22, 2021

mratsim left a comment

mratsim Dec 23, 2021

arnetheduck Dec 28, 2021

arnetheduck Dec 28, 2021

arnetheduck Dec 28, 2021

arnetheduck Dec 28, 2021

mratsim Dec 23, 2021

arnetheduck Dec 28, 2021

arnetheduck Dec 29, 2021

mratsim Dec 23, 2021

arnetheduck Dec 28, 2021

arnetheduck Dec 28, 2021

arnetheduck commented Dec 29, 2021

Lazy aggregated batch verification #3212

Lazy aggregated batch verification #3212

Conversation

arnetheduck commented Dec 19, 2021 • edited Loading

github-actions bot commented Dec 20, 2021 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck commented Dec 21, 2021

arnetheduck commented Dec 22, 2021

mratsim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck commented Dec 29, 2021

arnetheduck commented Dec 19, 2021 •

edited

Loading

github-actions bot commented Dec 20, 2021 •

edited

Loading