Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy aggregated batch verification #3212

Merged
merged 6 commits into from
Dec 29, 2021
Merged

Conversation

arnetheduck
Copy link
Member

@arnetheduck arnetheduck commented Dec 19, 2021

A novel optimisation for attestation and sync committee message
validation: when batching, we look for signatures of the same message
and aggregate these before batch-validating: this results in up to 60%
fewer signature verifications on a busy server, leading to a significant
reduction in CPU usage.

  • increase batch size slightly which helps finding more aggregates
  • add metrics for batch verification efficency
  • use simple blsVerify when there is only one signature to verify in
    the batch, avoiding the RNG

A novel optimisation for attestation and sync committee message
validation: when batching, we look for signatures of the same message
and aggregate these before batch-validating: this results in 10-20%
fewer signature verifications on a busy server, leading to a significant
reduction in CPU usage.

* increase batch size slightly which helps finding more aggregates
* add metrics for batch verification efficency
* use simple `blsVerify` when there is only one signature to verify in
the batch, avoiding the RNG
@github-actions
Copy link

github-actions bot commented Dec 20, 2021

Unit Test Results

     12 files  ±0     763 suites  +4   36m 51s ⏱️ + 1m 31s
1 537 tests +2  1 491 ✔️ +2  46 💤 ±0  0 ±0 
9 133 runs  +7  9 037 ✔️ +7  96 💤 ±0  0 ±0 

Results for commit c77150d. ± Comparison against base commit c7be88b.

♻️ This comment has been updated with latest results.

@@ -74,7 +89,7 @@ const
# (RNG for blinding and Final Exponentiation)
# are amortized,
# but not too big as we need to redo checks one-by-one if one failed.
BatchedCryptoSize = 16
BatchedCryptoSize = 24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there measurements on how often failures necessitating backtracking/retrying the batching occur, which would become more acute with larger batches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only time I've seen them is when we had a bug in the signature verification function - by and large, during an attack, each node will locally discard any faulty signatures so they do not spread - that means that only during a targeted attack can there be a failure here, against an individual node

@arnetheduck
Copy link
Member Author

Screenshot from 2021-12-21 17-42-11

This shows a mainnet node at batchSize = 24, 32 and 72 respectively.

  • the average number of signatures per aggregate, although not visible due to scale, hovers around 1.15 with 24, 32 and increases to 1.2 with 72
  • the average number of signatures per batch is naturally lower than the max because of the burst:iness of the data in general meaning some batches are full and some are not, but with the large batch size we can clearly see that the available space stops being used
  • 72 is interesting because that's roughly how many subnets we have (64 attestation subnets + 4 sync committee subnets + some others)

One issue with the way that validation is set up is that in libp2p, the "next" gossip message from a single node is not processed until the current one has been validated - this decreases concurrency, and therefore also the ability to build meaningful batches - this is specially problematic for lazy aggregation because one node is likely to be sending multiple messages for the same data when dealing with an attestation burst: all attestations on a single subnet will typically have the same message being signed, and nodes listen to particular subnets - cc @Menduist @dryajov

@arnetheduck arnetheduck marked this pull request as draft December 22, 2021 16:26
@arnetheduck
Copy link
Member Author

wow, massive difference with concurrent gossip processing - number of signatures per aggregate goes up from 1.2 to 1.9 which is huge - at 2, we'd be verifying half as many signatures

Screenshot from 2021-12-22 17-58-51

Copy link
Contributor

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I prefer constructing the SignatureSet in-place in the sequence because returning with init an extra-large datatypes (and in this case a tuple) tends to create very inefficient code in Nim with lots of memset(0) and here probably large stack allocation/reclaim.

Nitpick: This high batch number might need to be a compile-time const for minimal-sized testnets? If the committees/attestations never reach 72, we would always have a high 30ms delay.

pendingBuffer: seq[SignatureSet]
resultsBuffer: seq[Future[BatchResult]]
sigsets: seq[SignatureSet]
items: seq[BatchItem]

BatchCrypto* = object
# Each batch is bounded by BatchedCryptoSize (16) which was chosen:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment need to be updated to 72

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm playing around with different values to strike a good balance - it turns out that in practice, on a really heavily loaded server the average batch size on mainnet, after applying the libp2p fix to validate messages concurrently:

  • max = 72 gives 28-30 signatures per batch and 2.4-2.6 signatures per aggregate
  • max = 36 gives 17-18 signatures per batch and 1.7-1.9 signatures per aggregate

This suggests that some batches are large, but most are much smaller than the max - basically, when max is large, we occasionally use the given space, but not always - it's also likely that the additional space gives aggregation a better chance simply because we collect signatures from different subnets (and it is only per subnet that aggregation can happen).

An earlier version tried using a separate batcher per subnet explicitly, but that doesn't work very well at all - there's not enough traffic on a single subnet to consistently fill up the batches, so overall, it lowers the efficiency of batching.

high 30ms delay.

The "maximum" delay when the batch is not full remains at 10ms, or whenever the async loop wakes up. I suspect increasing this timeout would improve batching and aggregation, but it would indeed hurt the case you're describing.

Screenshot from 2021-12-28 10-55-48
Screenshot from 2021-12-28 10-55-37

72 vs 36, mainnet server

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important to note is that "signatures per batch" != "verifications per batch": when we have 2.5 signatures per aggregate, we're performing 30/2.5 = 12 verifications per batch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved and expanded commentary - back to 72, it's ridiculously more efficient in terms of throughput and helps significantly with maintaining a healthy mesh on a slow node giving it more time to deal with anomolies

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startTick = Moment.now()
ok =
if batchSize == 1: blsVerify(batch[].sigsets[0])
else: batchCrypto.verifier.batchVerify(batch[].sigsets)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to move this in nim-blscurve, thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that makes a lot of sense - this PR or separate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that was easier said than done: nim-blscurve doesn't know about the RNG so it can't avoid generating randomness in case of a single signature. Leaving for a potential future refactoring..

signature: CookedSig) =
func init(T: type SignatureSet,
pubkey: CookedPubKey, signing_root: Eth2Digest,
signature: CookedSig): T =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning large value types tend to produce quite inefficient code and a SigSet is huge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be fixed as of nim-lang/Nim#19115 which landed in the nim 1.2 branch as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the generated code: no memsets, RVO working as it should - I suspect you've been burnt by the nim bug previously:

N_LIB_PRIVATE N_NIMCALL(void, init__7Ffc9cVDV9aLbtP7bpwJ5yNA)(tyObject_PublicKey__wF9cg4i4Nl9cNaHoaQ6NXiVA *pubkey, tyObject_MDigest__law9ct65KplMYBvtmjCQxbw *signing_root, tyObject_Signature__ZPec3zScfAlRHnqARj9a9asg *signature, tyTuple__7PghGCO9ajM9a39c4M2A31RqQ *Result)
{
	(*Result).Field0 = (*pubkey);
	nimCopyMem((void *)(*Result).Field1, (NIM_CONST void *)(*signing_root).data, sizeof(tyArray__vEOa9c5qaE9ajWxR5R4zwfQg));
	(*Result).Field2 = (*signature);
}

@arnetheduck arnetheduck marked this pull request as ready for review December 29, 2021 14:18
@arnetheduck
Copy link
Member Author

tends to create very inefficient code

ok, I've dug around a bit and couldn't find any instances of this happening with the new function call style - as noted above, this was likely the result of a bug in the nim compiler that has been fixed since.

@arnetheduck arnetheduck merged commit 6b60a77 into unstable Dec 29, 2021
@arnetheduck arnetheduck deleted the lazy-batch-aggregate branch December 29, 2021 14:28
etan-status added a commit that referenced this pull request Jul 26, 2024
Followup of #3212 to test proper signature verification. Also document
possible further optimization based on blst `v0.3.13`.
etan-status added a commit that referenced this pull request Jul 26, 2024
Followup of #3212 to test proper signature verification. Also document
possible further optimization based on blst `v0.3.13`.
tersec pushed a commit that referenced this pull request Jul 26, 2024
Followup of #3212 to test proper signature verification. Also document
possible further optimization based on blst `v0.3.13`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants