Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read-write lookup offline-memory-checking logUp-based implementation in (Super)Nova R1CS #48

Draft
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

hero78119
Copy link
Contributor

@hero78119 hero78119 commented Sep 15, 2023

Implementation idea derived from Lookup Argument in Nova tricks: permutation checking without re-ordering

This idea as a demo, aiming to bridge the last mile of Nova R1CS to handle below scenarios (well known of high cost of constraints in R1CS)

  • read/write lookup table cross folding steps
  • memory execution trace proof in VM
  • bitwise operations
  • ...

in pretty low cost (comparing with Merkle Tree approach, maybe SOTA?) per incremental computation (folding step): few field arithmetics + one random oracle hash for challenge, agnostic to lookup table size |T|! At the end just invoke an product sumcheck which can be integrated in nature to CompressionSnark with other claims together.

The test case show heapify creation on an array to build minheap. Array are representative as table support read/write. There will be O(N) round of foldings, where N is the size of array. In folding step i, there will be 3 read (parent node, 2 child nodes) and 4 write (update nodes). This case simply show that a table write in folding step i can be read later in folding step j for j > i.

R1CS constraints cost regarding to rw lookup

  • poseidon: ~992 constraints => for compute challenge in step circuit
  • others constrains for r/w lookup consistency: ~ log_2(number of read write lookup in 1 step). For example, 2^10 (1024) read write cost ~ 10 constraints
  • extra variables in Nova state: 5 (rolling challenge, challenge, rolling_READ, rolling_WRITE, read write counter)

    read write counter can be omit in read-only table

Updates 2024-01-12: more up to date detail see comment below

Table read/write are wrapped into a gadget so it can be easily integrated in other applications. Gadget are supposed to be used in non-deterministic primary circuit, so there are other changes involved, basically set primary circuit as mutable reference so the lookup gadget can be updated and feeding non-deterministic aux witness easily.

PR still under drafted with following pending tasks (excuse for bad coding style in early stage :P )

  • pre-process global challenge
  • product sumcheck in CompressSnark
  • rollback to immutable step circuit design
  • debug: check why compress commitment in prove/decompress in verify got different result
  • thorough soundness check and documentation

Will keep ongoing work, and any review and feedback is welcome and appreciated :)

@hero78119 hero78119 changed the title read-write lookup implementation in (Super)Nova R1CS read-write lookup implementation via permutation fingerprint in (Super)Nova R1CS Sep 15, 2023
@hero78119 hero78119 marked this pull request as draft September 15, 2023 16:19
@hero78119 hero78119 force-pushed the nova_lookup branch 3 times, most recently from 9787e88 to 80331a8 Compare September 18, 2023 12:36
@hero78119 hero78119 force-pushed the nova_lookup branch 4 times, most recently from 61eb0fd to 478e3e3 Compare October 8, 2023 09:12
@hero78119 hero78119 marked this pull request as ready for review October 9, 2023 07:02
@hero78119 hero78119 force-pushed the nova_lookup branch 3 times, most recently from 2122ebe to aad19bc Compare October 12, 2023 14:15
Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, this looks very, very nice!
I left a few comments on the Rust, which should help further review.

src/spartan/lookupsnark.rs Outdated Show resolved Hide resolved
src/gadgets/lookup.rs Outdated Show resolved Hide resolved
src/gadgets/lookup.rs Outdated Show resolved Hide resolved
src/gadgets/lookup.rs Outdated Show resolved Hide resolved
src/gadgets/lookup.rs Outdated Show resolved Hide resolved
src/gadgets/lookup.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/gadgets/lookup.rs Outdated Show resolved Hide resolved
@hero78119
Copy link
Contributor Author

hero78119 commented Oct 18, 2023

Hi @huitseeker Thanks for the review :)
I have addressed all the comments in latest commits

Besides, it seems new CI failed are due to like timeout, not normal test failed https://github.com/lurk-lab/arecibo/actions/runs/6558929723/job/17813491899 I can't reproduce locally. On workflow panel I can't find the way to re-trigger CI as well

@huitseeker
Copy link
Contributor

If you rebase on top of #76 you should get better results with CI!

@adr1anh
Copy link
Contributor

adr1anh commented Nov 1, 2023

First off, great work on getting this type of extension working on top of Nova! I've looked through your hackmd and the code, here are some thoughts. I've got a few code-related comments coming later too.

Soundness

Both the lookup gadget and the logic behind the lookup SNARK look good to me. I found the latter a bit confusing though, do you have any spec you're basing this off of?

What does the process look like for finalizing the proof? The LookupSNARK seems to only prove the correctness of the permutation argument, but I don't see any mention of verifying the R1CS instance. Would a user also create a CompressedSNARK alongside?

It seems like a good idea to have some separation between both proofs to ensure users can opt out of lookups, but there are also some missed opportunities for batching and minimizing the number of verifier hashes.

Cost estimation

It would be great to get an estimate of the number of constraints/variables required per read and write operation in each folding step.
This seems to be dominated by the cost of absorbing 3 field elements.
Would it be possible to get such an estimate?

Depending on the context (read vs. r/w, table size, number of accesses), it may be cheaper to consider the "vanilla" approach of performing these accesses solely with R1CS constraints.

Compressing SNARK

The Quarks approach for proving a grand product of $2^m$ terms requires a commitment to a vector of length $2^{m+1}$. For this lookup protocol, there are two such products to compute for both sides of the permutation check, leading to a total of $2^{m+2}$, or $4\cdot |T|$ committed elements, where $T$ is the size of the RAM.

The polynomial also needs to be opened at 2 different points, which can double the cost of the PCS opening. In the case of ZeroMorph, the reduction to univariate polynomials needs to be performed twice, leading to $\approx m$ additional group elements in the proof, and doubling the verifier costs.

In contrast, when the proof system has a mechanism for accessing "shifted" values or "rotations" of a vector, the grand product argument from PlonK can be used, which only requires the commitment to a single column of length $|T|$.

With the introduction of ZeroMorph, it would be possible to leverage this new PCS to implement the more efficient variant of product-check.

Witness generation

From a user-perspective, using lookups in this way means that the entire witness trace needs to be precomputed ahead of time, due to challenges alpha and gamma. Fundamentally, the amount of storage required to prove the program becomes linear in the number of folding steps, whereas it is constant for general folding.

This is fine for some applications, but can be impractical for others. It would be good to note what type of computations this SNARK is well-suited for. I think your example is a good one: An algorithm defined in terms of a loop, where the body has access to an array. In contrast to a non-folding-based circuit, the complexity of the compressing SNARK only depends on the size of the body.

Different techniques

Efficient lookup arguments will most likely all require modifications to some parts of the proof system. There are also some proposals (such as Origami and the LogUp variant in Protostar) that integrate lookups directly as part of the folding.

The main issue is that they require the recursive folding verifier to perform more scalar multiplications. With an implementation of CycleFold, this may have negligible effect on the recursive circuit, and may be competitive compared to the number of Poseidon hashes performed with the approach in this PR.

What your work shows though, is that the RAM/ROM abstraction is far more useful than fixed lookups. A recent paper surveys the different approaches to memory-consistency, and highlights how different proof systems implement this abstraction. It would be very neat to see these adapted to Nova, to get some intuition about the different performance profiles of the techniques.

In any case, being able to benchmark against an existing implementation will be very useful to figure out what approach works best!

@hero78119
Copy link
Contributor Author

Hi @adr1anh thanks for the insightful feedback! I will try to address thoughts in comments gradually :)

Soundness

The specs/flows in LookupSNARK::prove/verify is basically referred from RelaxedR1CSSNARK::prove() and RelaxedR1CSSNARK::verify() whereas in a simpler version. In LookupSNARK we only need offline-memory-check product-sumcheck + univariate PCS.

And yes, the R1CS verification should create CompressSNARK alongside. I choose to separation is due to I don't want to affect the exist CompressSNARK logic and I think it's easy for preliminary review, and ofc downside sacrificing efficiency on batching. I think the further optimisation on batching for LookupSNARK together with CompressSNARK definitely works because they share almost same flow of prove and verify. We can have a new CompressLookupSNARK, which tightly couple and allowing proving R1CS + Lookup together. The optimisation is open question an It will be nice to have more opinions from application side.

Computations of this SNARK well-suited

Actually, this SNARK is especially useful for applications which involves ALL scenarios in step circuit

  • random-memory-address
  • bitwise-operation
  • range check
    One part not fully demonstrate in this PR (and docs) is we can combine all the tables into a single big table, with each table occupied range of address space in setup phase! We can define different constant offset for them so utilize more on large field, and address space does not overlap. Since there is just single table to cover all usage, in Nova state the state variable remain constant size. and challenge are shared so the cost are amortized significantly. In the final LookupSNARK, values in final table are composed of dynamic table commitment from prorver + fixed table commitment pre-defined in verifier side in setup phase.

Because current approach are based on indexed-lookup, means to prove v exist in table T we need to provide index (or address) a such that T[a] = v. Then a question might raise: if we combine all into single table, how to assure the non-deterministic address (e.g. range check) a from prover are within a valid address range ? Intuitively we shouldn't have range check on a via lookup otherwise it's a chicken-egg problem.

A strategy is, we can placement different tables into single table in a well-defined order, such that the table which address are well-constrainted (such as RAM where the address to rw is derived from computation) are in low memory address, while the tables (address are supposed to be non-deterministic from prover, such as range check) are in higher address. Then, we can replace the non-deterministic address from prover by T[offset + v] == v where the v is the value to range check. We put fixed lookup table at higher address consecutively so the soundness: T[offset + v'] == v' where v' is outside of range is not possible. (This property can be proved easily and intuitively)

@hero78119
Copy link
Contributor Author

hero78119 commented Nov 27, 2023

Cost estimation

This seems to be dominated by the cost of absorbing 3 field elements.
In more detail, the cost can be spilited into 2 parts

  • absorbing 3 field elements -> poseidon hash circuit got padding, therefore the marginal cost for adding first read/write vs non first is different. The marginal cost when adding non-first read/write remain constant for a while after between the padding size.
  • extra R1CS var/constraints per read/write cost as below
name extra var extra constrains
Read on RWTable +15 new var +16 constraints
Write on RWTable +15 new var +16 constraints
Read on ReadOnlyTable + 8 new var + 10 R1CS constraints
Write on ReadOnlyTable + 8 new var + 10 R1CS constraints

table for RW cost more due to there are extra circuit to constraints global counter to avoid out-of-order r/w

read/write operation cost remain the same, because they used the similar way to operating R/W set

Giving we concat all table into a single table with separating address space.
On one scenarios: fixed lookup, e.g. 8 bits range check comparison

8 bit range check extra var extra constrains
vanilla approach +8 new var (bit) +9 constraints (8 constrains as bit + 1 bit decompositions)
lookup approach +15 new var +16 constraints

Which indeed "vanilla approach" outperform.

While in another fixed lookup, e.g. 64 bits XOR

64 bit XOR extra var extra constrains
vanilla approach 64 * 3 = 192 new var (bit) 64 * 2 (constrains as bit) + 64 (bit XOR) + 3 (bit decomposition) = 195
lookup approach (16 bit limb XOR) 4*3(16 limb decomposition) + 4 (limb concat as address) + 1 (result) + 4 * (15 for lookup) = 77 3 (16 limb decomposition) + 4 (limb concat as address) + 4 * (16 for lookup) = 71 constraints

Where this lookup approach shows a significant improvement over vanilla approach.

Above estimated show current lookup approach is fruitful on R1CS for application which got complex step circuit, e.g. zkVM.

there is an simple optimisze to erach less var/constraints via reuse constant/constraints for gamma/gamma^2 in same step circuit => now it's duplicated per read/write, will optimize and update this PR

Witness && memory overhead

For this LookupSNARK, the witness for lookup need to be know in advance outside of circuit, means there need to be an external witness generator. The external witness generator is very common in zkVM application to firstly execute the program first on fast x86 machine then talk to zk proving system to generate the zk proof. It's a good idea to document this in some place, or maybe let unit test demonstrate it ?

Fundamentally, the amount of storage required to prove the program becomes linear in the number of folding steps, whereas it is constant for general folding.

I am not sure about this part. I think the linear amount of storage is not due to this LookupSNARK design, whereas it's due to Nova non-deterministic step circuit design so we need to instantiate immutable step circuits in advanced. Each step circuit only hold the non-deterministic input for its round, so the size remain constant. @adr1anh would you mind to explain more for the linear in the number of folding steps?

New approach & techniques

In this PR, product sum augment are leveraged on old ppsnark->ProductSumcheckInstance, while now it was refactor renamed to ppsnark-> LookupSumcheckInstance => ppsnark->MemorySumcheckInstance recently and ported to Arecibo dev. Furthermore, I noticed there is another open PR #136 which use MemorySumcheck logUp protocol. I think logUp approach introduce extra subtle cost in folding step comparing current LookupSNARK approach, however it can , like one extra R1CS constraint to have the field inverse. ofc it will trade for better verifier cost.

Update: with logUp approach implemented in MemorySumcheck then we can reduce 2 challenge alpha/gamma into just one, so can reduce one more Nova state variable, with extra cost by just + 1 constraint + variable

The verifier cost of final SNARK will be amortized for application with massive folding steps. Comparing with existing SOTA lookup attempt on R1CS Nova, I think first attempt of offline-memory-check tricks in Nova folding which lack of rotation still be a good milestone to save quite of cost on circuit size for complex application. We can left benchmark with other approach when other implementation also ready, whats your opinion :) ?

@hero78119 hero78119 force-pushed the nova_lookup branch 2 times, most recently from 9b0ea2c to 1f33dbb Compare January 9, 2024 13:27
@hero78119
Copy link
Contributor Author

hero78119 commented Jan 9, 2024

updated PR with below change summary

  • Refactor MemorySumcheckInstance to support different request of logUp proving -> I named with new name MemorySumcheckInstanceV2 so I can test it in ppsnark without touch batched_ppsnark quite a lot in POC phase
  • Lookup switch to MemorySumcheckInstance and design CompressedSNARKV2 to support lookup sumcheck batching with all other sumcheck stuff, so user dont need to have separate SNARK to improve efficiency
  • rebase to latest arecibo dev and fixed all compiling error
  • run unittest cargo test --package arecibo --lib -- tests::test_ivc_rwlookup --exact got InvalidSumcheckProof failed, I am debugging and guessing just some small bug exist. Except for this, all other test in ppsnark are passed. So the mem_row, mem_col refactor I have verified success
  • Remove challenge poseidon circuit from folding circuit. Based on previous discussion, I think we can defer all challenge from folding circuit to final snark. Actually we already accumulate this rlc value in folding circuit (this is read_row doing). And in final snark prove phase, we let prover to aux a plonkish circuit to flatten this rlc field value (field value = read_row + final_table_read_row) to a column, then using this column to as input to compute poseidon or ECMH value in CompressedSNARK, then assert it equal with first challenge giving in Nova state. I am thinking from plonkish circuit perspective, and encode plonkish verify circuit in R1CS, although I think R1CS should work as well.

So with above optimized refactor, my lookup approach in folding will be extreme low cost, since in folding everything are working on field value without touch chanllenge circuit. challenge is defered to final snark. I guess this should be ultimate solution 😄 ?

Because so far I do not have clear idea for how to implement my idea about challenge defer in final CompressedSNARK R1CS, so it will be good if can gain some input from you

  • after that, the final thing is multiple table support. @adr1anh gave very nice suggestion by having a const table value, e.g. + gamma^3 * TABLE_CONST so we can support multiple table easily.

@hero78119 hero78119 changed the title read-write lookup implementation via permutation fingerprint in (Super)Nova R1CS read-write lookup offline-memory-checking logUp-based implementation in (Super)Nova R1CS Jan 12, 2024
@hero78119 hero78119 marked this pull request as draft January 12, 2024 07:11
@wyattbenno777
Copy link
Contributor

wyattbenno777 commented Jan 15, 2024

Thanks for this great write-up and PoC @hero78119 ! This approach seems very promising for zkVM where the execution trace is done in advanced followed by ZKP work.

This all boils down to, how do we know that the prover did not cheat and change alpha and gamma, while keeping them deterministic. The portion 'Here share a Fiat-Shamir variant for non-interactive incremental computation' https://hackmd.io/C6XlfTRFQ_KoRy1Eo-4Pfw?view

Removing challenge Poseidon Circuit (costs, benefits)


I do not follow all of the details specifically in your last checkbox above, but having access to alpha and gamma with the execution trace upfront, allows for a lot of flexibility. For example a small separate SNARK could be created where the the Prover commits to the init values and checks that they match the final returned value; I think this is the approach you are planning on doing. circa: 'first challenge giving in Nova state'.

This seems like a very good first version.

How do we make this more modular?

  1. If bi, ai, ci are assumed as public and known in advanced the final check can be computed on the fly by either the prover or verifier. In this case we would simply be doing a check that the NOVA IVC returned the correct alpha == gamma value for the table iterations against publicly known values.

  2. Would removing the hashes in folding circuit make it less modular? i.e. preventing IVC (stop at anytime and verify). In your comment above it seems like IVC would still be possible with just the read_row.

But keeping the hash allows devs to add more values into it in a compressed form: hash(alpha and gamma, bi, ai, ci, more if we want..) at increased cost of absorbing field elements. They could also toggle out 'ci' or other values and make them public inputs where needed, this would reduce the cost of the hash; In practice there may not be much need to keep the multiplicities private, or in circuit. Or if most of them are 0, tricks can be used from other sparse polynomial papers.

For pseudo-IVC the prover could pre-commit to values at possible 'stop points' representing what the hash should be at those points (or public values), if the IVC circuit is stopped at a break point it could run the SNARK needed for proof. This would work something like Risc0 continuations.

While completely getting rid of the costs of the hash by removing the challenge Poseidon circuit, it might be helpful to think in terms of a execution trace, what values are public, and/or prover pre-commitments to possible stop points.

In the case where the full execution trace is not known, or is not needed, a user could decide only to do these steps where lookups are used. MLE-structured tables F.3.1 would be used instead.

It would really be cool if we get this and also have great modularity :)

EDIT Other observations:

'Address bi constrained by step circuit opcode. If its for memory load/store, its constrained by instruction itself.' from here https://hackmd.io/C6XlfTRFQ_KoRy1Eo-4Pfw?view - in the case of SuperNova bi would be zi; any need to absorb it again?

@hero78119 hero78119 force-pushed the nova_lookup branch 2 times, most recently from 3d2313d to 0580e95 Compare January 22, 2024 08:08
@hero78119
Copy link
Contributor Author

hero78119 commented Jan 22, 2024

Removing challenge Poseidon Circuit (costs, benefits)

How do we make this more modular?

  1. If bi, ai, ci are assumed as public and known in advanced the final check can be computed on the fly by either the prover or verifier. In this case we would simply be doing a check that the NOVA IVC returned the correct alpha == gamma value for the table iterations against publicly known values.
  2. Would removing the hashes in folding circuit make it less modular? i.e. preventing IVC (stop at anytime and verify). In your comment above it seems like IVC would still be possible with just the read_row.

But keeping the hash allows devs to add more values into it in a compressed form: hash(alpha and gamma, bi, ai, ci, more if we want..) at increased cost of absorbing field elements. They could also toggle out 'ci' or other values and make them public inputs where needed, this would reduce the cost of the hash; In practice there may not be much need to keep the multiplicities private, or in circuit. Or if most of them are 0, tricks can be used from other sparse polynomial papers.

For pseudo-IVC the prover could pre-commit to values at possible 'stop points' representing what the hash should be at those points (or public values), if the IVC circuit is stopped at a break point it could run the SNARK needed for proof. This would work something like Risc0 continuations.

While completely getting rid of the costs of the hash by removing the challenge Poseidon circuit, it might be helpful to think in terms of a execution trace, what values are public, and/or prover pre-commitments to possible stop points.

In the case where the full execution trace is not known, or is not needed, a user could decide only to do these steps where lookups are used. MLE-structured tables F.3.1 would be used instead.

Hi @wyattbenno777 thanks for the feedbacks. Let me try to expand more context on challenge part refactor plans. You proposed many good points which is not covered by this PR, as this PR still in preliminary version with goals

  • based on indexed lookup to support random memory access.
  • lookup cost in folding step circuit just incur field operation
  • make a lookup gadget for cross folding step lookup (including random memory access)

For this lookup implementation, the read_set/write_set accumulation which implemented in folding step circuit, are already in pretty low cost by just few fields operations. Here are a raw summary #48 (comment) as can see, the cost for an RW access just +15 R1CS var + 16 R1CS constraints. But when we introduce this lookup approach, there is an cost for challenges alpha, gamma computation in folding circuit. Although this cost is agnostic to number of read/write and only introduce once, however it's still costly, as around ~992 constraints for Poseidon RO.

So one optimisation suggested by @adr1anh is can we seek another low cost way to accumulate and verify challenge alpha, gamma integrity in folding circuit. After some quick thoughts, I proposed an approach that we can defer RO in folding circuit to final SNARK.

  • Prover still compute challenges alpha/gamma in witness generation phase and put alpha/gamma in z0 state
  • Based on offline memory check formula (in LogUp variant), alpha, gamma should be derived from sequence of (bi, ai, ci). which can be viewed as sequence of read_set trace
  • In folding step circuit, we already accumulated read_set trace as a field value, which in core is linear combinations of (bi, ai, ci) with alpha/gamma
  • in final compression snark, verifier can get read_set field value from final state zn. To verify integrity of challenge, prover can witness (auxiliary), generating a proof, to attest the challenge-integrity to prove below
    - public input: alpha, gamma, read_set, audit_value_commitment, audit_ts_commitment
    - assert read_set == logUp_sum([(bi, ai, ci), (bi, ai, ci),.... (bi, ai, ci)], alpha, gamma) // (bi, ai, ci) sequence are witnessed from prover
    - computed_alpha := poseidon([(bi, ai, ci), (bi, ai, ci),.... (bi, ai, ci), audit_value_commitment, audit_ts_commitment]) // use (bi, ai, ci) to compute alpha
    - computed_gamma := poseidon(computed_alpha)
    - assert alpha == computed_alpha
    - assert gamma == computed_gamma

With this optimization, we do not need to encode poseidon circuit in folding step circuit, which is somehow achieve the match the real meaning of low-cost.

Would removing the hashes in folding circuit make it less modular? i.e. preventing IVC (stop at anytime and verify). In your comment above it seems like IVC would still be possible with just the read_row.

Actually I am not quite sure about this question related to modularity if removing challenge from folding circuit. (The vision seems bigger), because challenges in lookup gadget are assumed to be blackbox and do not expect other application-level manipulation, so it will be easily to use and bug-pruning.
I think IVC still can stop at any point and just depends on what you want to verify if stop earlier. For example, we can still verify recursive_snark correctness, but we can't take intermediate field value from z_i state and verify it business logic are satisfied. For example in zkVM case, we need to verify the pc are terminated at correct step. In this case, pc value will be in memory commitment.

Hopefully my elaboration for deferring challenge-integrity is clear :)

@wyattbenno777
Copy link
Contributor

Thank you for that very clear write-up on the LogUp proposed changes @hero78119 . That all seems very clear to be a SOTA implementation :)

I am messing around with your code a bit in a fork specifically in the context of memory consistency checks.
In the original Blum work, which the offline memory check from JOLT comes from, and also the way it was explained to me by one of the JOLT authors, the procedure goes as follows:

  1. W is init to (addr, val, 0 (multiplicities)) for all entries.
  2. for each 'read' there is a following 'write'.
  3. In the last step there is a read with no following write (added to R). This ensures that W.len == R.len and is ready for a permutation check.
  4. The tuples are made checkable with a Reed-solomon fingerprint (alpha and gamma). (LogUp work here.)

I assume it was done this way in research and in Lasso for soundness.
Please excuse me if I missed it in the fork or gadget here — but are we still following this 1,2,3,4 paradigm?

@adr1anh
Copy link
Contributor

adr1anh commented Jan 23, 2024

Hi all, and thank you for all the insightful discussion and work! I'd like to explain some thoughts around the construction using logUp, and to try and figure out how to make it more modular. Apologies in advance for the length of this response 😅.

The main point I would like to make though, is that I think we may want to consider doing the audit as a circuit, rather than a specialized SNARK. Even though this may be less efficient, it will allow us to explore these primitives and figure out how they can be applied more generally to zkVM designs. It should also allow us to explore different approaches such as ECMH (described in the Spice paper ) which do not depend on challenges r, gamma to be generated in advance.

Background

In the framework of Blum referenced by @wyattbenno777, we are maintaining two multi-sets R and W that contain entries (v, a, t), and after all the read/write operations, we check that R = W.

The original paper instantiated the multi-sets R, W using a hash-chain of all entries. Given the full pre-image of both sets, the verifier can check that these two hashes to indeed contain the same entries, though this takes time linear in the number of reads/writes.

The scheme can be more efficient for the verifier by implementing the multi-set abstraction in a more "ZK way" using some newer techniques.

As I understand in Lasso, we are constructing a set S by defining it as a pair of field elements acc_S, h_S. Given n elements in S, we would can define them as:

$$ acc_{S}(r, \gamma) = \prod_{i = 1}^n (r + v_i + \gamma \cdot a_i + \gamma^2 \cdot t_i) $$

$$ h_S = H((v_1, a_1, t_1), \ldots, (v_n, a_n, t_n)) $$

Alternatively using logUp, we can write acc_S as

$$ acc_{S}(r, \gamma) = \sum_{i = 1}^n \frac{1}{r + v_i + \gamma \cdot a_i + \gamma^2 \cdot t_i} $$

Both of these formulations are equivalent, assuming that r, gamma are both derived from h_S.

When applied to memory checking, both parties need to derive acc_R, h_R, acc_W, h_W. The way this is done in circuit is that the prover will supply r, gamma as public input, and the verifier is able to iteratively compute acc_S, h_S by adding entries to the running sum/product acc_S and hashing the entry to h_S. At the end of the program, we can add any remaining elements to W and then check that r, gamma was derived from (h_R, h_W) and that acc_R = acc_S.

The observation that @hero78119 implemented using logUp, is that we can combine both R,W into a single set S, by using the fact that we can add and remove elements from a set. I have written down the following construction, so that we can understand together how it works, and then figure out how to prove it secure in our specific context of (N)IVC.

Description

Init:

  • Get r, gamma as public input, which are derived by the prover beforehand using the same access patterns as defined below.
  • Set acc = 0, h = 0, t = 1

Read a:

  • Get (v', t') non-deterministically as the current value and timestamp of the RAM
  • Check that t' < t
  • Check that (t' = 0) => (v' = 0) for ensuring that the initial value was set to 0
  • Remove (v', a, t') from S
    • Update h = H(h, (v', a, t'))
    • Update acc = acc - 1/(r + v' + a * gamma + t' * gamma^2)
  • Add (v', a, t) to S
    • Update h = H(h, (v', a, t))
    • Update acc = acc + 1/(r + v' + a * gamma + t * gamma^2)
  • Optimization: merge remove and add
    • Update h = H(h, (a, v', t, t'))
    • Update acc = acc - 1/(r + v' + a * gamma + t' * gamma^2) + 1/(r + v' + a * gamma + t * gamma^2)
  • Update current timestamp t = t + 1
  • Return v'

Write (a, v):

  • Get (v', t') non-deterministically as the current value and timestamp of the RAM
  • Check that t' < t
  • Check that (t' = 0) => (v' = 0) for ensuring that the initial value was set to 0
  • Remove (v', a, t') from S
    • Update h = H(h, (v', a, t'))
    • Update acc = acc - 1/(r + v' + a * gamma + t' * gamma^2)
  • Add (v, a, t) to S
    • Update h = H(h, (v, a, t))
    • Update acc = acc + 1/(r + v + a * gamma + t * gamma^2)
  • Optimization: merge remove and add
    • Update h = H(h, (a, v', v, t, t'))
    • Update acc = acc - 1/(r + v' + a * gamma + t' * gamma^2) + 1/(r + v + a * gamma + t * gamma^2)
  • Update current timestamp t = t + 1

Audit:

  • Let A = {(v_1, a_1, t_1), ..., (v_n, a_n, t_n)} be the current state of the RAM, sorted by addresses a_i.
    • Define a_0 = 0 as sentinel
  • For i = 1, ..., n
    • Check that a_{i-1} < a_i to ensure that all addresses are unique.
    • Check that (t_i = 0) => (v_i = 0)
    • Insert the initial entry (0, a_i, 0) into S
      • Update h = H(h, (0, a_i, 0))
      • Update acc = acc + 1/(r + 0 + a_i * gamma + 0 * gamma^2)
    • Remove the final entry (v_i, a_i, t_i) from S
      • Update h = H(h, (v_i, a_i, t_i))
      • Update acc = acc - 1/(r + v_i + a_i * gamma + t_i * gamma^2)
    • Optimization: merge remove and add
      • Update h = H(h, (a_i, v_i, t_i))
      • Update acc = acc - 1/(r + v_i + a_i * gamma + t_i * gamma^2) + 1/(r + a_i * gamma)
  • Check that r, gamma are derived from h
  • Check that acc = 0

Application to NIVC

The above construction can easily be implemented purely in an NIVC setting. In the first phase, we modify each circuit's IO, so that it takes as in

  • Program IO z_in -> z_out
  • Timestamp t_in -> t_out
  • Hash chain h_in -> h_out
  • logUp accumulator acc_in -> acc_out
  • Constant challenges r, gamma that are passed as-is from input to output.
  • Two auxiliary variables (is_audit, a_prev) which are initialized to (0, -1) which are also passed as-is by the circuits which are not the audit one. It is only used by the audit circuit to keep track of the uniqueness check, and needs to be passed to the other circuits to ensure all SuperNova circuits have the same IO.

The initial input of the program in the first iteration is (z_0, t = 0, h = 0, acc = 0, r, gamma, is_audit = 0, a_prev = -1)

Each step circuit updates t, h, acc according to the reads and writes it performs. At the very end, we run an audit circuit as another NIVC circuit that will be initialized with the last output of the actual computation. The goal is to run audit iteratively until the we get to acc = 0

audit(z, t, h, acc, r, gamma, is_audit, a_prev):
	if (acc = 0)
		// we can safely reset t, a_prev to 0
		// since they are no longer required
		// and we can hide the number of accesses
		// and the largest address accessed
		return (z, t=0, h, acc=0, r, gamma, is_audit a_prev=-1)

	let (v', a', t') <- ... // non-deterministic alloc

	// in the first iteration, skip the comparison to allow for the zero address
	if is_audit = 0
		is_audit = 1
	else 
		// check uniqueness
		assert(a_prev < a')

	// base case checking 
	assert((t' = 0) => (v' = 0))

	// update hash
	h' = H(h, (a', v', t'))

	// add init
	acc' = acc + 1/(r + gamma * a')

	// remove current entry 
	acc' = acc' - 1/(r + v' + gamma * a' + gamma^2 * t')

	return (z, t, h', acc', r, gamma, is_audit, a')

The final SuperNova verifier will need to check

  • r, gamma is derived from h
  • acc = 0 and that the audit circuit was successful
  • the output program_counter corresponds to the audit circuit index
  • t, is_audit, a' can safely be ignored
    The audit circuit can be run iteratively until we reach acc = 0. Subsequent calls will ensure that the total number of accesses defined by t is hidden from the verifier. For folding efficiency, we can write the audit circuit as one which repeats itself many times.

We can probably find some better set of conditions to ensure uniqueness in the NIVC setting (I'm not yet 100% sure it is sound), but for now the definition of the audit circuit should satisfy our requirements.

By implementing the audit circuit entirely in-circuit, we can ensure that the construction is entirely compatible with an existing SuperNova verifier, and would be a lot cheaper since we do not need any additional Sumcheck or PCS openings to verify the state.

It is slightly more expensive in-circuit, but it does bring us closer to a more modular framework as @wyattbenno777 was talking about.

Using a SNARK for the audit

We need to impose several extra restrictions if we want to use a SNARK-like approach for performing the audit.

Firstly, we will need to define a fixed sized RAM where all addresses are integers in a range [0,N[, as a way of ensuring uniqueness of the addresses.

The read and write operations can be defined as before, and we can use the known bound 0 ≤ t < N to enforce a cheaper comparison check (as I think is already implemented in this PR).

During the audit SNARK, we need to supply 2 committed vectors V, T of size N such that (V_i, i, T_i) is the entry at the i-th address. Address i was not used, then we set V_i = T_i = 0.

The Sumcheck relation is then given by

$$ acc_S = \sum_{i=0}^{N-1} \frac{1}{r + V_i + \gamma \cdot i + \gamma^2 \cdot T_i} - \frac{1}{r + \gamma \cdot i } $$

Here, exploit the fact that the verifier can easily evaluate the polynomial whose evaluations are the first N integers (if N is a power of 2).

We then need to check that r, gamma were properly derived from h, V_i, T_i.

Removing hashes

Unfortunately, we cannot avoid the hashes that are performed during the read and write operation by the circuit. The main reason has to do with the semantics of the underlying interactive protocol that implements the multi-set that we are emulating in-circuit. Essentially, every tuple we add or remove from the set can be thought of as a value provided by the prover. It therefore appears in the corresponding Fiat-Shamir transcript. In our circuit implementation, we can think of the interaction as follows:

  • The prover sends a list of the tuples (v, a, t) which are added and removed from the multi-set, with any particular order.
  • The verifier needs to add all these tuples to the transcript to derive the next challenges r, gamma.
  • The verifier can now compute the logUp accumulator acc by adding or subtracting all tuples sent by the prover.

In the circuit, we are exploiting the non-determinism and can supply r, gamma in advance, which allows the verifier implemented in the circuit to compute acc and h iteratively. However, it is crucial that the tuple of every single insertion and removal is added to the transcript, even when it is seemingly added twice.

Soundness

It seems to me that we can easily adapt the existing security proof from the existing literature to the above construction. Though for the sake of discussion, we can try to understand intuitively why it should hold.

Before the audit happens, we can enforce the following invariant. By performing the reads and writes inside the circuit, we are ensuring that every update to S is valid. We are ensuring the property that if an address a was accessed at some timestamp t' and has value t', then the acc will contain the term

$$ \frac{1}{r + v' + a\cdot \gamma + t' \cdot \gamma^2 } - \frac{1}{r+ \gamma \cdot a} $$

At time t > t' a read or a write of v at address a transforms the term respectively into.

$$ \frac{1}{r + v' + a\cdot \gamma + t \cdot \gamma^2 } - \frac{1}{r+ \gamma \cdot a} $$

and

$$ \frac{1}{r + v + a\cdot \gamma + t \cdot \gamma^2 } - \frac{1}{r+ \gamma \cdot a}. $$

Two terms for the same address a can never occur twice, since reads and writes can never update acc by adding an entry with t=0. This is only possible during the audit phase. Moreover, by removing the addresses in order and checking that they are unique will ensure that the negative term is removed re-inserted only once.

@hero78119
Copy link
Contributor Author

hero78119 commented Jan 24, 2024

@adr1anh your summary and terminologies to formailize overall idea, is SUPER SUPER x 100 helpful for rest of us to speak on same math language 💯 💯 Thanks again for such a thorough and detail post. Based on your formalized terms, I can expand more on what current PR doing so it will helpful for other audience

Background & Application in (N)IVC

Giving multi-set W = R

W/R both encoded in LogUp sum
image

in this PR, TS polynomial are set as constant 1, so it equivalently checking 2 set equality via any element in one set is lookup ONCE in another set.

we can further expand W == R into W_initial_acc + W_acc = R_acc + R_audit_acc
obviously

W_initial_acc - R_audit_acc = R_acc - W_acc

To elaborate each terms

  • W_initial_acc: this is initial state of RAM
  • R_audit_acc: this is final state of RAM
  • R_acc, W_acc: both field value manipulated in IVC circuit by removing/adding rules

R_acc - W_acc != 0 in overwhelming probability.

Currently PR implement LogUp sumcheck with R_acc - W_acc as initial claim on target formula W_initial_acc - R_audit_acc, while W_initial_acc, R_audit_acc can be both derived in audit phase via SNARK solution (no audit circuit yet)

So

cited from @adr1anh Check that acc = 0 in audit phase

Currently in audit phase, I am checking acc = R_acc - W_acc via logUp sumcheck. Not acc = 0

r, gamma derived from H_s

To derived challenges , theoretically in sound we should derived from h_R, h_W.
However in current PR i made an optimisation by just derived challenges from h_R only, according to 2 observed fact, informally sound analysis

  • in the end R = W need to be satisfied, so imply h_R = h_W.
  • entries adding to h_R, h_W are constraints by IVC circuit implicitly

Furthermore, h_R field elements can be splitted into two subset:

  • h_R := h_R_ivc + h_R_audit
  • h_R_ivc := H({(v_i, a_i, t_i)…}), while tuple lists(v_i, a_i, t_i) are non-deterministic input from prover when perform read/write in step circuit
  • h_R_audit := final RAM state This set can be represented by non-deterministic polynomial commitment from prover via comitment({vi...}) and comitment({t_i…})

commitment({ai...}) are pre-known address vector [0, 1, 2, ….] so naturally in can be encode in verifier key or SNARK verifier flow.

For current PR implementation, h_R_ivc computation is implemented as Poseidon circuit in IVC (which lead to higher one time cost in step circuit) h_R_audit are from final SNARK
In the audit phase, challenges r gamma are derived from h_R, and comparing with prover supplied r gamma from public input.

Removing hashes

My proposed plan was to derived h_R_ivc tuples by a smaller scope audit circuit. Observed h_R_ivc can be derived from R_acc because both can be derived from raw set {(v_i, a_i, t_i)…}. So in the audit circuit, we can

Proposed audit circuit scope is smaller, just covering challenges integrity verification.

Limitation of proposed Removing hashes from IVC

One of the drawback for the solution is, if we implemented audit circuit in SRS bounded fashion, then we add a limitation cap that this audit circuit not handle massive read/write operations, for example zkEVM in riscv which can lead to billion number of read/writes. If we implemented in separate folding circuit, then the cost will still higher. I am still refer other literature you shared above and see other better alternative. If there are low cost alternative by still having challenge in folding circuit, or we can have challenge-free solution it will be super nice!

from @adr1anh: Unfortunately, we cannot avoid the hashes that are performed during the read and write operation by the circuit. ….(omitted partially)…. However, it is crucial that the tuple of every single insertion and removal is added to the transcript, even when it is seemingly added twice….

I am not pretty sure about this description for why it’s infeasible. Because in my solution, every single insertion/removal is still adding to transcript, just it’s implemented in audit circuit. Audit circuit force the entries sum equal to R_acc, then it’s still being used to computed final challenge. Besides there is no chicken-egg soundness like computing challenge using challenge as input.

I believe what you point out is some critical soundness, and appreciate for further elaboration 🙏

@adr1anh
Copy link
Contributor

adr1anh commented Jan 24, 2024

Challenge derivation

in the end R = W need to be satisfied, so imply h_R = h_W.

Even if the two sets are equal, they are constructed in a different order and therefore the insertions into each set will lead to h_R != h_W. Unless you are talking here about the logUp accumulator acc_R and acc_W, in which case you're right that they are equal, but the accumulator has to be computed using r, gamma, and these need to be derived from every single element that is used to compute the acc. More precisely, we need r, gamma <- H(h_R, h_W), where the pre-image of h_R and h_W contains all the individual terms that make up acc_R, acc_W.

In the description of the protocol I wrote about, there is a possible optimization that reduces the number of hashes, where we only use a single hash-chain h_RW, where we only need to hash (a, v', v, t, t') for a write, and ( v', a, t, t') for a read to h_RW. This is because we are doing the following operations in each operation:

  • Read: Add (a, v', t') to R and add (a, v', t) to W
  • Write: Add (a, v', t') to R and add (a, v, t) to W

During any read or write, the prover has to supply (v', t') non-deterministically, essentially an untrusted input from the prover. Therefore it must be added to the transcript h_RW. It seems that we may be able to get away from adding (v, t), since these values are derived by the IVC circuit. However, this requires some very specific conditions on the IVC circuit. I think we can write it as

  • The IVC circuit input z_0 needs to be added initially to the h_RW
  • Given z_0 and all the previously inserted entries (v', t'), it should be possible to deterministically derive (a, v, t).
    The second condition seems hard to check thoroughly, and I'm not even sure what the exact requirement would be. Unless we can derive a proper definition and accompanying security proof, I think we should still add the "deterministic" values (a, t) to h_RW.

Hashing using PCS

Adding polynomial commitments to the transcript/hash accumulator h should provide the same guarantees as using a field hash function like Poseidon due to the binding property of Pedersen commitments.

At the moment though, we can only use polynomial commitments alongside a SNARK-like protocol where we can use Sumcheck. Long-term, I think we should be able to incorporate these directly into the arithmetization layer, so that the Poseidon hashes can be removed entirely from the circuit. The main idea is that we allow the R1CS circuit to have additional witness columns A, V, T which are committed to before we start synthesizing W, X. We could then write the circuit by considering Z' = [A, V, T, W, X]. For an IVC sequence of witness columns [Z'_0, ..., Z'_n], the X component would have to contain r, gamma, which would be derived the sequence of polynomial commitments [(comm(A_0), comm(V_0), comm(T_0)), ..., (comm(A_n), comm(V_n), comm(T_n))]. For those familiar with halo2, this would be akin to allowing committed instance columns that are used for lookups.

Both techniques are morally equivalent, yet the PCS approach requires significantly more engineering to be usable. However, it seems like this will be the ultimate way of making the construction as efficient as possible in the long term. Moreover, I think it will lead to a cleaner overall design, where the entirety of the lookup logic is contained in-circuit, which gives an easier way to "opt-out".

One of the drawback for the solution is, if we implemented audit circuit in SRS bounded fashion, then we add a limitation cap that this audit circuit not handle massive read/write operations

I don't think this is any more true than with a SNARK, and in fact it is more flexible. With the SNARK approach, we are forced to use a fixed-size memory and the run-time of the prover will always be the size of the array. With the circuit approach, we only need to remove entries that were actually accessed. This allows us to consider a much larger address space, allowing for a mapping with arbitrary keys. Given that the number of unique addresses is variable in each IVC execution, we can write the audit circuit as I did above to be recursive, which allows us to run it an arbitrary number of times, and each iteration can audit a fixed number of entries.

From a practical approach though, and given how much we still need to figure out about the construction in terms of efficiency and security, I think we should design the construction in such a way that it's efficiency can be better analyzed from a more theoretical perspective, which will help guide us towards an "optimal" solution that is also practical to implement.

I am not pretty sure about this description for why it’s infeasible. Because in my solution, every single insertion/removal is still adding to transcript, just it’s implemented in audit circuit. Audit circuit force the entries sum equal to R_acc, then it’s still being used to computed final challenge. Besides there is no chicken-egg soundness like computing challenge using challenge as input.

The problem is that each entry is loaded non-deterministically twice and could be different (once when it is written, and once when it is read). In the honest case, both of these values will be equal, but we must assume it is not the case and protect against this case. We need to prevent against the case where the value and address being written can be chosen non-deterministically as they could then be chosen in a way that depends on r, gamma and would allow us to make the acc_W equal anything, including acc_R.

It may technically be possible to argue that the entry added to the write set W does not need to be hashed, but we would have to argue more thoroughly that it is ok.

As I'm writing this though, I think we might be able to argue that adding t to h_RW is not necessary if we initialize h_RW = H(h_RW_prev, t_init), where t_init is the current time-stamp at the start of the IVC circuit.
The address a seems a lot harder to argue since it computed by the circuit in ways that may not be linked to the IVC input z.

@hero78119 hero78119 force-pushed the nova_lookup branch 2 times, most recently from 9f4be89 to c0e8730 Compare January 26, 2024 09:13
add rwlookup test: heaplify

add pre-computed challenge mechanism

fix type mutabble in bench

make clippy happy

optimise constraints

wip

add lookupsnark

reformat

constraints scope to immutable stepcircuit

lookupsnark test pass

fix clippy

sumcheck back to private module

rollback sumcheck under ppsnark module to private

enable defering challenge check

separate permutation fingerprint challenges to alpha and gamma

chores: mutability, typo and better rust syntax

chores: use let-else instead

cleanup map_aux_dirty

read_row/write_row should be pass to verify func as arguments

chores: code cosmetics

implement intoiterator for lookup

chores: fix let-else usage

chores: remove tad pedantic usage

optimize lookuprtace rw_trace processing

tmp having RelaxedR1CSSNARKV2 for support table lookup efficiently

implemenet compressedsnarkv2

implement LogUp approach and padding
@wyattbenno777
Copy link
Contributor

I am rebasing this here and will make a PR in the next few days. I also moved/renamed some things into their own files so we do not have a bunch of V2 in one long file.

@wyattbenno777
Copy link
Contributor

rebased and PRed here
#386

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants