Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline Nova allocations (Arecibo backport) #277

Closed
wants to merge 4 commits into from

Conversation

huitseeker
Copy link
Contributor

@huitseeker huitseeker commented Dec 5, 2023

What

This backports the following PRs in Arecibo:

Why

These PRs streamline the memory allocations occuring in performance-sensitive parts of Nova by applying a few simple techniques:

  • pre-allocating known-size vectors,
  • when a known-size amount of data will be allocated on each iteration of a tight loop, reuse the allocation from the prior iteration,
  • read arguments by reference in few-argument matrix operations rather than copying those arguments to a single contiguous allocation

Outcome

Each PR (in Arecibo: each commit here) has been benchmarked on Lurk-size circuits showing its impact on large witnesses / instances, and each shows a 5-8% improvement on proving latency.

Local benchmarks (small circuits, no memory pressure, high variance)

critcmp results (meant to demonstrate this does not make things worse from a CPU PoV)
group                                                        main                                    nova_allocations
-----                                                        ----                                    ----------------
CompressedSNARK-Commitments-StepCircuitSize-0/Prove          1.00       7.3±0.08s             1.00       7.3±0.12s
CompressedSNARK-Commitments-StepCircuitSize-0/Verify         1.00    188.3±5.29ms             1.02    191.5±4.17ms
CompressedSNARK-Commitments-StepCircuitSize-121253/Prove     1.01      23.3±0.58s             1.00      23.1±0.20s
CompressedSNARK-Commitments-StepCircuitSize-121253/Verify    1.05   728.1±60.72ms             1.00   694.4±20.50ms
CompressedSNARK-Commitments-StepCircuitSize-22949/Prove      1.00      11.6±0.16s             1.00      11.6±0.16s
CompressedSNARK-Commitments-StepCircuitSize-22949/Verify     1.07   333.4±31.70ms             1.00   312.0±10.16ms
CompressedSNARK-Commitments-StepCircuitSize-252325/Prove     1.00      43.4±0.87s             1.01      43.9±0.78s
CompressedSNARK-Commitments-StepCircuitSize-252325/Verify    1.03  1213.0±107.21ms            1.00  1173.6±48.71ms
CompressedSNARK-Commitments-StepCircuitSize-55717/Prove      1.00      19.7±0.35s             1.00      19.8±0.43s
CompressedSNARK-Commitments-StepCircuitSize-55717/Verify     1.00   599.4±29.04ms             1.00   599.9±28.39ms
CompressedSNARK-Commitments-StepCircuitSize-6565/Prove       1.00      10.8±0.13s             1.00      10.7±0.09s
CompressedSNARK-Commitments-StepCircuitSize-6565/Verify      1.04   302.6±10.64ms             1.00   291.6±10.66ms
CompressedSNARK-StepCircuitSize-0/Prove                      1.01   799.1±26.03ms             1.00   791.7±10.32ms
CompressedSNARK-StepCircuitSize-0/Verify                     1.02     31.3±1.15ms             1.00     30.7±0.61ms
CompressedSNARK-StepCircuitSize-1038757/Prove                1.00      48.6±0.76s             1.00      48.4±1.05s
CompressedSNARK-StepCircuitSize-1038757/Verify               1.00  1599.8±38.46ms             1.01  1622.6±36.11ms
CompressedSNARK-StepCircuitSize-121253/Prove                 1.00       6.5±0.27s             1.01       6.6±0.15s
CompressedSNARK-StepCircuitSize-121253/Verify                1.02   210.2±12.48ms             1.00   206.0±12.69ms
CompressedSNARK-StepCircuitSize-22949/Prove                  1.00  1865.4±41.89ms             1.01  1879.3±40.81ms
CompressedSNARK-StepCircuitSize-22949/Verify                 1.00     70.5±3.44ms             1.00     70.3±2.00ms
CompressedSNARK-StepCircuitSize-252325/Prove                 1.00      12.7±0.36s             1.01      12.8±0.37s
CompressedSNARK-StepCircuitSize-252325/Verify                1.00   438.9±21.82ms             1.04   455.3±23.93ms
CompressedSNARK-StepCircuitSize-514469/Prove                 1.00      24.7±0.63s             1.00      24.6±0.63s
CompressedSNARK-StepCircuitSize-514469/Verify                1.01   701.7±30.84ms             1.00   697.0±31.46ms
CompressedSNARK-StepCircuitSize-55717/Prove                  1.01       3.5±0.12s             1.00       3.5±0.11s
CompressedSNARK-StepCircuitSize-55717/Verify                 1.00    123.0±4.74ms             1.02    124.9±7.61ms
CompressedSNARK-StepCircuitSize-6565/Prove                   1.00  1147.2±13.65ms             1.03  1180.6±52.56ms
CompressedSNARK-StepCircuitSize-6565/Verify                  1.00     46.8±1.97ms             1.01     47.2±1.51ms
RecursiveSNARK-StepCircuitSize-0/Prove                       1.00     68.6±4.89ms             1.02    69.7±12.03ms
RecursiveSNARK-StepCircuitSize-0/Verify                      1.01     28.2±0.79ms             1.00     28.0±0.57ms
RecursiveSNARK-StepCircuitSize-1038757/Prove                 1.00  1504.2±82.11ms             1.00  1500.8±37.38ms
RecursiveSNARK-StepCircuitSize-1038757/Verify                1.00  1390.6±33.57ms             1.02  1423.2±81.45ms
RecursiveSNARK-StepCircuitSize-121253/Prove                  1.00    224.4±5.93ms             1.04   233.0±11.44ms
RecursiveSNARK-StepCircuitSize-121253/Verify                 1.01    177.0±7.56ms             1.00    176.0±9.24ms
RecursiveSNARK-StepCircuitSize-22949/Prove                   1.00    107.8±7.98ms             1.01   108.7±12.36ms
RecursiveSNARK-StepCircuitSize-22949/Verify                  1.00     59.1±2.03ms             1.00     59.4±0.88ms
RecursiveSNARK-StepCircuitSize-252325/Prove                  1.00   393.3±15.95ms             1.04   407.6±16.87ms
RecursiveSNARK-StepCircuitSize-252325/Verify                 1.00   343.9±25.63ms             1.01   347.8±15.34ms
RecursiveSNARK-StepCircuitSize-514469/Prove                  1.00   781.9±21.67ms             1.02   793.9±25.06ms
RecursiveSNARK-StepCircuitSize-514469/Verify                 1.02   718.6±28.72ms             1.00   706.0±30.34ms
RecursiveSNARK-StepCircuitSize-55717/Prove                   1.00    151.4±7.48ms             1.03   155.5±13.56ms
RecursiveSNARK-StepCircuitSize-55717/Verify                  1.00     92.6±2.94ms             1.08    100.3±7.13ms
RecursiveSNARK-StepCircuitSize-6565/Prove                    1.01     77.4±4.27ms             1.00     76.6±4.25ms
RecursiveSNARK-StepCircuitSize-6565/Verify                   1.00     36.3±0.65ms             1.02     37.0±1.56ms

h/t @winston-h-zhang 👏

@srinathsetty
Copy link
Collaborator

Hi @huitseeker thanks for the PR!

RecursiveSNARK-StepCircuitSize-0/Prove 1.00 68.6±4.89ms 1.02 69.7±12.03ms
RecursiveSNARK-StepCircuitSize-6565/Prove 1.01 77.4±4.27ms 1.00 76.6±4.25ms
RecursiveSNARK-StepCircuitSize-22949/Prove 1.00 107.8±7.98ms 1.01 108.7±12.36ms
RecursiveSNARK-StepCircuitSize-55717/Prove 1.00 151.4±7.48ms 1.03 155.5±13.56ms
RecursiveSNARK-StepCircuitSize-121253/Prove 1.00 224.4±5.93ms 1.04 233.0±11.44ms
RecursiveSNARK-StepCircuitSize-252325/Prove 1.00 393.3±15.95ms 1.04 407.6±16.87ms
RecursiveSNARK-StepCircuitSize-514469/Prove 1.00 781.9±21.67ms 1.02 793.9±25.06ms
RecursiveSNARK-StepCircuitSize-1038757/Prove 1.00 1504.2±82.11ms 1.00 1500.8±37.38ms

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

@huitseeker
Copy link
Contributor Author

huitseeker commented Dec 6, 2023

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

This work was originally developed as a bundled package in lurk-lang/arecibo#118 and tested in lurk-lab/lurk-beta#890.
What you're seeing here is broken-down changes.
In particular, for lurk-lang/arecibo#139 we have benches at lurk-lab/lurk-beta#923.

@srinathsetty
Copy link
Collaborator

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

At what circuit size is this PR going to provide benefits? The PR proposes a fairly large, arguably somewhat complicated-to-maintain, change to the way memory is managed. I think it would be good to rigorously understand the benefits it provides. Specifically, for what circuits and what step circuit sizes is this beneficial?

@porcuquine
Copy link
Contributor

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

At what circuit size is this PR going to provide benefits? The PR proposes a fairly large, arguably somewhat complicated-to-maintain, change to the way memory is managed. I think it would be good to rigorously understand the benefits it provides. Specifically, for what circuits and what step circuit sizes is this beneficial?

Let me interpret the Lurk benchmarks shown here (you may need to click the disclosure triangle to see them. I will reproduce below):
lurk-lab/lurk-beta#923 (comment)

Test Base PR %
LEM Fibonacci Prove - rc = 100/fib/num-100 4.8±0.03s 4.4±0.03s -8.33%
LEM Fibonacci Prove - rc = 100/fib/num-200 9.7±0.33s 9.2±0.08s -5.15%
LEM Fibonacci Prove - rc = 600/fib/num-100 3.8±0.01s N/A N/A
LEM Fibonacci Prove - rc = 600/fib/num-200 8.4±0.10s N/A N/A

The rc referenced is 'reduction count'. That is the number of inner Lurk circuits bundled in the application circuit. The Lurk circuit is about 12,000 constraints, so rc=100 is about 1.2M constraints. The fibonacci part (num-100, num-200) determines how large the calculation is, and we can derive the number of folding steps from that. In these cases, the calculation uses 11 and 33 steps respectively.

So, to synthesize, on Lurk's fibonacci workload, the answer to your question is:

For 1.2M constraints iterated 17 times, we see an 8.33% improvement.
For 1.2M constraints iterated 33 times, we see a 5.15% improvement.

Note that these metrics are a bit tricky to interpret because the first step is cheaper and therefore biases toward better performance on shorter computations.

It's possible I presented some of this info wrong, but that is the basic answer. @winston-h-zhang can fill in or fix any details I omitted or got wrong.

@huitseeker
Copy link
Contributor Author

huitseeker commented Dec 7, 2023

For more signal on memory, here is the top-level memory profiling of the last round of the minroot example (65536 it/step) using bytehound. First is on main, second is the PR.

On main On PR

As you can see, the overall memory usage is far from taxing for a modern machine. But it's sufficient to see each of the nine steps of folding that each create their short-lived allocation on main (and which do not on the PR).

As to the extrapolation of how this leads to a degradation in performance through memory fragmentation in larger proofs, here's what happens if if zoom out of the first picture (same dataset, on main) to see all 7 rounds of Minroot :
Screenshot 2023-12-06 at 8 53 22 PM
As you can see, the memory usage pattern scales linearly with the number of constraints. Another way to interpret this is that, for a fixed number of constraints, the performance impact of this PR (through fragmentation) will depend on how memory-pressured the machine is.

@winston-h-zhang
Copy link
Contributor

winston-h-zhang commented Dec 7, 2023

To elaborate more on @porcuquine and @huitseeker, I've provided some more benchmarks that give more context to why we've pursued this new memory model, and the breakdowns of improvements each PR provides.

Context

In theory, having many large reallocations of vectors puts a lot of load on the memory allocator, due to factors like fragmentation. Thus, we should try to avoid reallocating memory whenever it is possible to reuse already available space. Concretely, the following benchmark compares running the proving pipeline under (1) ideal allocator conditions and (2) bad allocator conditions.

rc Test Base (fresh) Base (load) %
100 Fibonacci(100) 6.60s 7.47s 113%
Fibonacci(200) 14.58s 14.89s 102%
600 Fibonacci(100) 4.33s 6.34s 146%
Fibonacci(200) 10.24s 14.65s 143%

In this context, Lurk offers two operational scenarios: "fresh" and "load." In the "fresh" scenario, public parameters are regenerated for each run of the prover, while in the "load" scenario, these parameters are read from a persistent source, like disk storage. The choice between "fresh" and "load" does not semantically alter Lurk's functioning. However, a significant performance regression has been observed, seemingly without a clear cause. This issue illustrates a broader potential problem: changes in system variables that do not semantically impact the proving process could still inadvertently trigger regressions. This is particularly problematic due to the current instability of the memory pipeline, where even minor alterations might disrupt the allocator and cause regressions in a way that is hard to detect and debug. A key distinction between the two scenarios is the differing levels of memory pressure they exert.

Breakdown of Improvements

The following 3 benchmarks show how each PR additionally improves the regression we observe, until there is no more regression. Note, each next benchmark includes all the previous improvements.

lurk-lang/arecibo#139

rc Test Base (load) remove r1cs clones %
100 Fibonacci(100) 7.47s 6.60s 88%
Fibonacci(200) 14.89s 13.16s 88%
600 Fibonacci(100) 6.34s 5.58s 88%
Fibonacci(200) 14.65s 12.36s 84%

lurk-lang/arecibo#143

rc Test Base (load) preallocated witness + previous %
100 Fibonacci(100) 7.47s 6.54s 87%
Fibonacci(200) 14.89s 12.90s 86%
600 Fibonacci(100) 6.34s 5.04s 79%
Fibonacci(200) 14.65s 11.38s 77%

lurk-lang/arecibo#144

rc Test Base (load) large vector allocations + previous %
100 Fibonacci(100) 7.47s 5.71s 76%
Fibonacci(200) 14.89s 11.32s 76%
600 Fibonacci(100) 6.34s 4.98s 78%
Fibonacci(200) 14.65s 10.50s 71%

Results

Significant improvements are not anticipated when comparing with our "fresh" scenario, as the comparison is made against a scenario that operates without memory pressure. Consequently, this scenario does not experience the drawbacks of an architecture struggling to cope with such pressure. However, the significant improvement compared to our "load" scenario is strong evidence that the memory pipeline is much more stable.

@srinathsetty
Copy link
Collaborator

Thanks for adding all the context! This is a very important optimization in the big picture. We would like to eventually get this feature merged.

One thing that we need to get better clarity is how this kind of optimization will impact future code changes. For example, in some instances, this inherently requires the caller to allocate and manage memory of the callee.

Have you considered making this to be based on traits? For example, in the current PR (assuming I understand the full code correctly), pub struct ResourceBuffer<E: Engine> (defined in lib.rs) seems to be very much tied to the particulars of nifs.rs. Is there a pattern in which we can have nifs.rs define certain types that lib.rs can use as a black box (with certain trait methods)?

The reason to bring up this design is in the future, we might abstract away nifs.fs with a trait so that lib.rs knows nothing about the internal details.

winston-h-zhang and others added 4 commits May 2, 2024 03:59
* remove large vector allocations

* add suggestions
- Refactored the 'prove_mut' function in 'nifs.rs'
- Removed the necessity of absorbing U1 in the `absorb_in_ro` function, reducing redundant steps.
@srinathsetty
Copy link
Collaborator

Given lack of activity on this PR and significant divergence from main, closing it for now. Please reopen if you wish to contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants