-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamline Nova allocations (Arecibo backport) #277
Conversation
Hi @huitseeker thanks for the PR!
From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly? |
This is correct and expected. This benchmark does not stress the allocator. This work was originally developed as a bundled package in lurk-lang/arecibo#118 and tested in lurk-lab/lurk-beta#890. |
At what circuit size is this PR going to provide benefits? The PR proposes a fairly large, arguably somewhat complicated-to-maintain, change to the way memory is managed. I think it would be good to rigorously understand the benefits it provides. Specifically, for what circuits and what step circuit sizes is this beneficial? |
Let me interpret the Lurk benchmarks shown here (you may need to click the disclosure triangle to see them. I will reproduce below):
The So, to synthesize, on Lurk's fibonacci workload, the answer to your question is: For 1.2M constraints iterated 17 times, we see an 8.33% improvement. Note that these metrics are a bit tricky to interpret because the first step is cheaper and therefore biases toward better performance on shorter computations. It's possible I presented some of this info wrong, but that is the basic answer. @winston-h-zhang can fill in or fix any details I omitted or got wrong. |
For more signal on memory, here is the top-level memory profiling of the last round of the minroot example (65536 it/step) using As you can see, the overall memory usage is far from taxing for a modern machine. But it's sufficient to see each of the nine steps of folding that each create their short-lived allocation on main (and which do not on the PR). As to the extrapolation of how this leads to a degradation in performance through memory fragmentation in larger proofs, here's what happens if if zoom out of the first picture (same dataset, on main) to see all 7 rounds of Minroot : |
To elaborate more on @porcuquine and @huitseeker, I've provided some more benchmarks that give more context to why we've pursued this new memory model, and the breakdowns of improvements each PR provides. ContextIn theory, having many large reallocations of vectors puts a lot of load on the memory allocator, due to factors like fragmentation. Thus, we should try to avoid reallocating memory whenever it is possible to reuse already available space. Concretely, the following benchmark compares running the proving pipeline under (1) ideal allocator conditions and (2) bad allocator conditions.
In this context, Lurk offers two operational scenarios: "fresh" and "load." In the "fresh" scenario, public parameters are regenerated for each run of the prover, while in the "load" scenario, these parameters are read from a persistent source, like disk storage. The choice between "fresh" and "load" does not semantically alter Lurk's functioning. However, a significant performance regression has been observed, seemingly without a clear cause. This issue illustrates a broader potential problem: changes in system variables that do not semantically impact the proving process could still inadvertently trigger regressions. This is particularly problematic due to the current instability of the memory pipeline, where even minor alterations might disrupt the allocator and cause regressions in a way that is hard to detect and debug. A key distinction between the two scenarios is the differing levels of memory pressure they exert. Breakdown of ImprovementsThe following 3 benchmarks show how each PR additionally improves the regression we observe, until there is no more regression. Note, each next benchmark includes all the previous improvements.
ResultsSignificant improvements are not anticipated when comparing with our "fresh" scenario, as the comparison is made against a scenario that operates without memory pressure. Consequently, this scenario does not experience the drawbacks of an architecture struggling to cope with such pressure. However, the significant improvement compared to our "load" scenario is strong evidence that the memory pipeline is much more stable. |
Thanks for adding all the context! This is a very important optimization in the big picture. We would like to eventually get this feature merged. One thing that we need to get better clarity is how this kind of optimization will impact future code changes. For example, in some instances, this inherently requires the caller to allocate and manage memory of the callee. Have you considered making this to be based on traits? For example, in the current PR (assuming I understand the full code correctly), The reason to bring up this design is in the future, we might abstract away |
801f4ee
to
0081109
Compare
* remove large vector allocations * add suggestions
- Refactored the 'prove_mut' function in 'nifs.rs' - Removed the necessity of absorbing U1 in the `absorb_in_ro` function, reducing redundant steps.
0081109
to
5c24dcb
Compare
Given lack of activity on this PR and significant divergence from main, closing it for now. Please reopen if you wish to contribute. |
What
This backports the following PRs in Arecibo:
Why
These PRs streamline the memory allocations occuring in performance-sensitive parts of Nova by applying a few simple techniques:
Outcome
Each PR (in Arecibo: each commit here) has been benchmarked on Lurk-size circuits showing its impact on large witnesses / instances, and each shows a 5-8% improvement on proving latency.
Local benchmarks (small circuits, no memory pressure, high variance)
critcmp results (meant to demonstrate this does not make things worse from a CPU PoV)
h/t @winston-h-zhang 👏