Streamline Nova allocations (Arecibo backport) #277

huitseeker · 2023-12-05T23:00:42Z

What

This backports the following PRs in Arecibo:

Why

These PRs streamline the memory allocations occuring in performance-sensitive parts of Nova by applying a few simple techniques:

pre-allocating known-size vectors,
when a known-size amount of data will be allocated on each iteration of a tight loop, reuse the allocation from the prior iteration,
read arguments by reference in few-argument matrix operations rather than copying those arguments to a single contiguous allocation

Outcome

Each PR (in Arecibo: each commit here) has been benchmarked on Lurk-size circuits showing its impact on large witnesses / instances, and each shows a 5-8% improvement on proving latency.

Local benchmarks (small circuits, no memory pressure, high variance)

critcmp results (meant to demonstrate this does not make things worse from a CPU PoV)

group                                                        main                                    nova_allocations
-----                                                        ----                                    ----------------
CompressedSNARK-Commitments-StepCircuitSize-0/Prove          1.00       7.3±0.08s             1.00       7.3±0.12s
CompressedSNARK-Commitments-StepCircuitSize-0/Verify         1.00    188.3±5.29ms             1.02    191.5±4.17ms
CompressedSNARK-Commitments-StepCircuitSize-121253/Prove     1.01      23.3±0.58s             1.00      23.1±0.20s
CompressedSNARK-Commitments-StepCircuitSize-121253/Verify    1.05   728.1±60.72ms             1.00   694.4±20.50ms
CompressedSNARK-Commitments-StepCircuitSize-22949/Prove      1.00      11.6±0.16s             1.00      11.6±0.16s
CompressedSNARK-Commitments-StepCircuitSize-22949/Verify     1.07   333.4±31.70ms             1.00   312.0±10.16ms
CompressedSNARK-Commitments-StepCircuitSize-252325/Prove     1.00      43.4±0.87s             1.01      43.9±0.78s
CompressedSNARK-Commitments-StepCircuitSize-252325/Verify    1.03  1213.0±107.21ms            1.00  1173.6±48.71ms
CompressedSNARK-Commitments-StepCircuitSize-55717/Prove      1.00      19.7±0.35s             1.00      19.8±0.43s
CompressedSNARK-Commitments-StepCircuitSize-55717/Verify     1.00   599.4±29.04ms             1.00   599.9±28.39ms
CompressedSNARK-Commitments-StepCircuitSize-6565/Prove       1.00      10.8±0.13s             1.00      10.7±0.09s
CompressedSNARK-Commitments-StepCircuitSize-6565/Verify      1.04   302.6±10.64ms             1.00   291.6±10.66ms
CompressedSNARK-StepCircuitSize-0/Prove                      1.01   799.1±26.03ms             1.00   791.7±10.32ms
CompressedSNARK-StepCircuitSize-0/Verify                     1.02     31.3±1.15ms             1.00     30.7±0.61ms
CompressedSNARK-StepCircuitSize-1038757/Prove                1.00      48.6±0.76s             1.00      48.4±1.05s
CompressedSNARK-StepCircuitSize-1038757/Verify               1.00  1599.8±38.46ms             1.01  1622.6±36.11ms
CompressedSNARK-StepCircuitSize-121253/Prove                 1.00       6.5±0.27s             1.01       6.6±0.15s
CompressedSNARK-StepCircuitSize-121253/Verify                1.02   210.2±12.48ms             1.00   206.0±12.69ms
CompressedSNARK-StepCircuitSize-22949/Prove                  1.00  1865.4±41.89ms             1.01  1879.3±40.81ms
CompressedSNARK-StepCircuitSize-22949/Verify                 1.00     70.5±3.44ms             1.00     70.3±2.00ms
CompressedSNARK-StepCircuitSize-252325/Prove                 1.00      12.7±0.36s             1.01      12.8±0.37s
CompressedSNARK-StepCircuitSize-252325/Verify                1.00   438.9±21.82ms             1.04   455.3±23.93ms
CompressedSNARK-StepCircuitSize-514469/Prove                 1.00      24.7±0.63s             1.00      24.6±0.63s
CompressedSNARK-StepCircuitSize-514469/Verify                1.01   701.7±30.84ms             1.00   697.0±31.46ms
CompressedSNARK-StepCircuitSize-55717/Prove                  1.01       3.5±0.12s             1.00       3.5±0.11s
CompressedSNARK-StepCircuitSize-55717/Verify                 1.00    123.0±4.74ms             1.02    124.9±7.61ms
CompressedSNARK-StepCircuitSize-6565/Prove                   1.00  1147.2±13.65ms             1.03  1180.6±52.56ms
CompressedSNARK-StepCircuitSize-6565/Verify                  1.00     46.8±1.97ms             1.01     47.2±1.51ms
RecursiveSNARK-StepCircuitSize-0/Prove                       1.00     68.6±4.89ms             1.02    69.7±12.03ms
RecursiveSNARK-StepCircuitSize-0/Verify                      1.01     28.2±0.79ms             1.00     28.0±0.57ms
RecursiveSNARK-StepCircuitSize-1038757/Prove                 1.00  1504.2±82.11ms             1.00  1500.8±37.38ms
RecursiveSNARK-StepCircuitSize-1038757/Verify                1.00  1390.6±33.57ms             1.02  1423.2±81.45ms
RecursiveSNARK-StepCircuitSize-121253/Prove                  1.00    224.4±5.93ms             1.04   233.0±11.44ms
RecursiveSNARK-StepCircuitSize-121253/Verify                 1.01    177.0±7.56ms             1.00    176.0±9.24ms
RecursiveSNARK-StepCircuitSize-22949/Prove                   1.00    107.8±7.98ms             1.01   108.7±12.36ms
RecursiveSNARK-StepCircuitSize-22949/Verify                  1.00     59.1±2.03ms             1.00     59.4±0.88ms
RecursiveSNARK-StepCircuitSize-252325/Prove                  1.00   393.3±15.95ms             1.04   407.6±16.87ms
RecursiveSNARK-StepCircuitSize-252325/Verify                 1.00   343.9±25.63ms             1.01   347.8±15.34ms
RecursiveSNARK-StepCircuitSize-514469/Prove                  1.00   781.9±21.67ms             1.02   793.9±25.06ms
RecursiveSNARK-StepCircuitSize-514469/Verify                 1.02   718.6±28.72ms             1.00   706.0±30.34ms
RecursiveSNARK-StepCircuitSize-55717/Prove                   1.00    151.4±7.48ms             1.03   155.5±13.56ms
RecursiveSNARK-StepCircuitSize-55717/Verify                  1.00     92.6±2.94ms             1.08    100.3±7.13ms
RecursiveSNARK-StepCircuitSize-6565/Prove                    1.01     77.4±4.27ms             1.00     76.6±4.25ms
RecursiveSNARK-StepCircuitSize-6565/Verify                   1.00     36.3±0.65ms             1.02     37.0±1.56ms

h/t @winston-h-zhang 👏

srinathsetty · 2023-12-06T01:22:36Z

Hi @huitseeker thanks for the PR!

RecursiveSNARK-StepCircuitSize-0/Prove 1.00 68.6±4.89ms 1.02 69.7±12.03ms
RecursiveSNARK-StepCircuitSize-6565/Prove 1.01 77.4±4.27ms 1.00 76.6±4.25ms
RecursiveSNARK-StepCircuitSize-22949/Prove 1.00 107.8±7.98ms 1.01 108.7±12.36ms
RecursiveSNARK-StepCircuitSize-55717/Prove 1.00 151.4±7.48ms 1.03 155.5±13.56ms
RecursiveSNARK-StepCircuitSize-121253/Prove 1.00 224.4±5.93ms 1.04 233.0±11.44ms
RecursiveSNARK-StepCircuitSize-252325/Prove 1.00 393.3±15.95ms 1.04 407.6±16.87ms
RecursiveSNARK-StepCircuitSize-514469/Prove 1.00 781.9±21.67ms 1.02 793.9±25.06ms
RecursiveSNARK-StepCircuitSize-1038757/Prove 1.00 1504.2±82.11ms 1.00 1500.8±37.38ms

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

huitseeker · 2023-12-06T14:38:13Z

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

This work was originally developed as a bundled package in lurk-lang/arecibo#118 and tested in lurk-lab/lurk-beta#890.
What you're seeing here is broken-down changes.
In particular, for lurk-lang/arecibo#139 we have benches at lurk-lab/lurk-beta#923.

srinathsetty · 2023-12-06T18:20:01Z

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

At what circuit size is this PR going to provide benefits? The PR proposes a fairly large, arguably somewhat complicated-to-maintain, change to the way memory is managed. I think it would be good to rigorously understand the benefits it provides. Specifically, for what circuits and what step circuit sizes is this beneficial?

porcuquine · 2023-12-06T21:36:11Z

From these lines, it seems like, these preallocations only save 4 milliseconds for the prover when the step circuit is of size a million gates. This is only 0.26% improvement. Am I reading this correctly?

This is correct and expected. This benchmark does not stress the allocator.

At what circuit size is this PR going to provide benefits? The PR proposes a fairly large, arguably somewhat complicated-to-maintain, change to the way memory is managed. I think it would be good to rigorously understand the benefits it provides. Specifically, for what circuits and what step circuit sizes is this beneficial?

Let me interpret the Lurk benchmarks shown here (you may need to click the disclosure triangle to see them. I will reproduce below):
lurk-lab/lurk-beta#923 (comment)

Test	Base	PR	%
LEM Fibonacci Prove - rc = 100/fib/num-100	4.8±0.03s	4.4±0.03s	-8.33%
LEM Fibonacci Prove - rc = 100/fib/num-200	9.7±0.33s	9.2±0.08s	-5.15%
LEM Fibonacci Prove - rc = 600/fib/num-100	3.8±0.01s	N/A	N/A
LEM Fibonacci Prove - rc = 600/fib/num-200	8.4±0.10s	N/A	N/A

The rc referenced is 'reduction count'. That is the number of inner Lurk circuits bundled in the application circuit. The Lurk circuit is about 12,000 constraints, so rc=100 is about 1.2M constraints. The fibonacci part (num-100, num-200) determines how large the calculation is, and we can derive the number of folding steps from that. In these cases, the calculation uses 11 and 33 steps respectively.

So, to synthesize, on Lurk's fibonacci workload, the answer to your question is:

For 1.2M constraints iterated 17 times, we see an 8.33% improvement.
For 1.2M constraints iterated 33 times, we see a 5.15% improvement.

Note that these metrics are a bit tricky to interpret because the first step is cheaper and therefore biases toward better performance on shorter computations.

It's possible I presented some of this info wrong, but that is the basic answer. @winston-h-zhang can fill in or fix any details I omitted or got wrong.

huitseeker · 2023-12-07T02:01:02Z

For more signal on memory, here is the top-level memory profiling of the last round of the minroot example (65536 it/step) using bytehound. First is on main, second is the PR.

As you can see, the overall memory usage is far from taxing for a modern machine. But it's sufficient to see each of the nine steps of folding that each create their short-lived allocation on main (and which do not on the PR).

As to the extrapolation of how this leads to a degradation in performance through memory fragmentation in larger proofs, here's what happens if if zoom out of the first picture (same dataset, on main) to see all 7 rounds of Minroot :

As you can see, the memory usage pattern scales linearly with the number of constraints. Another way to interpret this is that, for a fixed number of constraints, the performance impact of this PR (through fragmentation) will depend on how memory-pressured the machine is.

winston-h-zhang · 2023-12-07T02:33:24Z

To elaborate more on @porcuquine and @huitseeker, I've provided some more benchmarks that give more context to why we've pursued this new memory model, and the breakdowns of improvements each PR provides.

Context

In theory, having many large reallocations of vectors puts a lot of load on the memory allocator, due to factors like fragmentation. Thus, we should try to avoid reallocating memory whenever it is possible to reuse already available space. Concretely, the following benchmark compares running the proving pipeline under (1) ideal allocator conditions and (2) bad allocator conditions.

rc	Test	Base (fresh)	Base (load)	%
100	Fibonacci(100)	6.60s	7.47s	113%
	Fibonacci(200)	14.58s	14.89s	102%
600	Fibonacci(100)	4.33s	6.34s	146%
	Fibonacci(200)	10.24s	14.65s	143%

In this context, Lurk offers two operational scenarios: "fresh" and "load." In the "fresh" scenario, public parameters are regenerated for each run of the prover, while in the "load" scenario, these parameters are read from a persistent source, like disk storage. The choice between "fresh" and "load" does not semantically alter Lurk's functioning. However, a significant performance regression has been observed, seemingly without a clear cause. This issue illustrates a broader potential problem: changes in system variables that do not semantically impact the proving process could still inadvertently trigger regressions. This is particularly problematic due to the current instability of the memory pipeline, where even minor alterations might disrupt the allocator and cause regressions in a way that is hard to detect and debug. A key distinction between the two scenarios is the differing levels of memory pressure they exert.

Breakdown of Improvements

The following 3 benchmarks show how each PR additionally improves the regression we observe, until there is no more regression. Note, each next benchmark includes all the previous improvements.

lurk-lang/arecibo#139

rc	Test	Base (load)	remove r1cs clones	%
100	Fibonacci(100)	7.47s	6.60s	88%
	Fibonacci(200)	14.89s	13.16s	88%
600	Fibonacci(100)	6.34s	5.58s	88%
	Fibonacci(200)	14.65s	12.36s	84%

lurk-lang/arecibo#143

rc	Test	Base (load)	preallocated witness + previous	%
100	Fibonacci(100)	7.47s	6.54s	87%
	Fibonacci(200)	14.89s	12.90s	86%
600	Fibonacci(100)	6.34s	5.04s	79%
	Fibonacci(200)	14.65s	11.38s	77%

lurk-lang/arecibo#144

rc	Test	Base (load)	large vector allocations + previous	%
100	Fibonacci(100)	7.47s	5.71s	76%
	Fibonacci(200)	14.89s	11.32s	76%
600	Fibonacci(100)	6.34s	4.98s	78%
	Fibonacci(200)	14.65s	10.50s	71%

Results

Significant improvements are not anticipated when comparing with our "fresh" scenario, as the comparison is made against a scenario that operates without memory pressure. Consequently, this scenario does not experience the drawbacks of an architecture struggling to cope with such pressure. However, the significant improvement compared to our "load" scenario is strong evidence that the memory pipeline is much more stable.

srinathsetty · 2024-01-23T18:13:52Z

Thanks for adding all the context! This is a very important optimization in the big picture. We would like to eventually get this feature merged.

One thing that we need to get better clarity is how this kind of optimization will impact future code changes. For example, in some instances, this inherently requires the caller to allocate and manage memory of the callee.

Have you considered making this to be based on traits? For example, in the current PR (assuming I understand the full code correctly), pub struct ResourceBuffer<E: Engine> (defined in lib.rs) seems to be very much tied to the particulars of nifs.rs. Is there a pattern in which we can have nifs.rs define certain types that lib.rs can use as a black box (with certain trait methods)?

The reason to bring up this design is in the future, we might abstract away nifs.fs with a trait so that lib.rs knows nothing about the internal details.

* remove large vector allocations * add suggestions

- Refactored the 'prove_mut' function in 'nifs.rs' - Removed the necessity of absorbing U1 in the `absorb_in_ro` function, reducing redundant steps.

srinathsetty · 2024-10-25T23:48:43Z

Given lack of activity on this PR and significant divergence from main, closing it for now. Please reopen if you wish to contribute.

huitseeker mentioned this pull request Jan 22, 2024

Nova forward ports lurk-lang/arecibo#276

Merged

huitseeker force-pushed the nova_allocations branch from 801f4ee to 0081109 Compare February 20, 2024 23:24

winston-h-zhang and others added 4 commits May 2, 2024 03:59

remove clones (microsoft#139)

5939ead

preallocate witness cs (microsoft#143)

e8d4fc9

Remove large vector allocations (microsoft#144)

6fe242e

* remove large vector allocations * add suggestions

fix: adapt to PR microsoft#291

5c24dcb

- Refactored the 'prove_mut' function in 'nifs.rs' - Removed the necessity of absorbing U1 in the `absorb_in_ro` function, reducing redundant steps.

huitseeker force-pushed the nova_allocations branch from 0081109 to 5c24dcb Compare May 2, 2024 07:59

srinathsetty closed this Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamline Nova allocations (Arecibo backport) #277

Streamline Nova allocations (Arecibo backport) #277

huitseeker commented Dec 5, 2023 •

edited

Loading

srinathsetty commented Dec 6, 2023

huitseeker commented Dec 6, 2023 •

edited

Loading

srinathsetty commented Dec 6, 2023

porcuquine commented Dec 6, 2023

huitseeker commented Dec 7, 2023 •

edited

Loading

winston-h-zhang commented Dec 7, 2023 •

edited

Loading

srinathsetty commented Jan 23, 2024

srinathsetty commented Oct 25, 2024

Streamline Nova allocations (Arecibo backport) #277

Streamline Nova allocations (Arecibo backport) #277

Conversation

huitseeker commented Dec 5, 2023 • edited Loading

What

Why

Outcome

Local benchmarks (small circuits, no memory pressure, high variance)

srinathsetty commented Dec 6, 2023

huitseeker commented Dec 6, 2023 • edited Loading

srinathsetty commented Dec 6, 2023

porcuquine commented Dec 6, 2023

huitseeker commented Dec 7, 2023 • edited Loading

winston-h-zhang commented Dec 7, 2023 • edited Loading

Context

Breakdown of Improvements

Results

srinathsetty commented Jan 23, 2024

srinathsetty commented Oct 25, 2024

huitseeker commented Dec 5, 2023 •

edited

Loading

huitseeker commented Dec 6, 2023 •

edited

Loading

huitseeker commented Dec 7, 2023 •

edited

Loading

winston-h-zhang commented Dec 7, 2023 •

edited

Loading