Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Extend the behavior of load/store ops to allow addressing all elements of a word #1064

Open
Tracked by #1104
bitwalker opened this issue Sep 5, 2023 · 20 comments
Milestone

Comments

@bitwalker
Copy link
Contributor

bitwalker commented Sep 5, 2023

Background

I'm currently finishing up portions of the code generator in the compiler for Miden IR, and something came up as I was working through how to translate memory operations in the IR (which is byte-addressable) to Miden Assembly (which as we all know is word-addressable).

In particular, I noticed that the mem_load and mem_store instructions, which load/store an individual field element, rather than a word, only work for the first element of the word at the given address. Up until now, I had been operating under the assumption that those instructions could load/store any element of the word, but had missed that the addresses are word-aligned, thus making it (at least at first glance) impossible to represent a load for, say, the 3rd element of a word. In short, this significantly reduces the usefulness and generality of these instructions.

Proposal

I would like to suggest a change to these instructions that not only addresses this gap in functionality, but is backwards-compatible to boot.

The following are true of the current semantics:

  • The address operand for these instructions is a field element
  • The maximum address allowed is 2^32-1, which is in the u32 range. The ops in question trap if the address is out of this range.
  • If we were to u32.split the address operand, the high bits would be all zero, and the low bits would hold the address

I'm proposing the following changes:

  • The number and type of the operands remains the same
  • The address operand is treated as a pair of u32 values, and is split using the same rules as u32.split.
  • The previously unused (but zeroed) high bits, will now be used to represent an offset from the base of the word at the address represented in the low bits. The instructions will trap if this offset is anywhere outside of the range 0-3.

Without this change, to perform the equivalent of mem_load on the 3rd element of a word, you'd need to issue a sequence of instructions like padw, push.ADDR, mem_loadw, drop, drop, swap, drop. With this change, you would use push.ADDR, add.0x100000003, mem_load. That's, by my count (if we assume that mem_load goes from 1 cycle to 2 cycles to account for the u32 split), an improvement of 6 cycles (10 vs 4) for a basic operation that is going to occur very frequently in programs with any local or global memory allocations.

The syntax in MASM could be extended for the version which accepts an immediate, such that the current syntax (e.g. mem_load.0x0) is equivalent to specifying a constant offset of 0. This would be extended to accept a constant offset as well (e.g. mem_load.0x0.3).

Potential Issue(s)

The primary issue I can think of, if I'm trying to find ways to break this, is that distinguishing between an out-of-bounds address and a valid address + offset pair, becomes impossible. The case where an out-of-bounds address appears to be an in-bounds address with an invalid offset, isn't really important to distinguish, as both will trap as invalid. However, it is possible, albeit unlikely, for an out-of-bounds address to appear as a valid in-bounds address + offset pair, in which case a load or store that would have previously trapped, will now read/write some arbitrary location in memory.

We could defend against this by bit-packing a 24-bit CRC code in the high-bits after the offset, which must be checked when the offset is non-zero, but whether it is worth imposing such a cost on every load/store really depends on how important it is that we catch such cases. Personally, I think it would be best to specify that the behavior is undefined in such cases, just like there is little that protects you on a more typical system from writing past the end of an array - it might trap, or it might not.

Alternatives

The proposed changes are essential in my opinion, because without them, it is exceedingly awkward to treat memory as word-addressable. The primary alternative is to instead treat memory as felt-addressable. There are a couple of problems with this:

  • The amount of memory allocated for a program which discards 75% of each word allocated will in the worst case be 3x larger than an equivalent program that can make full use of each word.
  • It becomes much more expensive to load values larger than a field element, requiring 2x the cycles just to perform the loads, not counting the additional ops required to compute each element's address.
  • You lose the symmetry of being able to load, operate on, and then store word-sized values using the native instructions. You only really remain unaffected with regard to the word-oriented instructions where the word is already on the operand stack, but getting words on and off the stack are much more awkward.

--

I should note that this largely an issue which affects compilers, where fewer (or even no) assumptions can be made about where a given value is located in memory, so a consistent memory model is important. That said, even with hand-written assembly, you'd find yourself in a situation where you are either forced to waste most of a word, or required to emit expensive instruction sequences to read out a word and then extract what you need from it.

@bobbinth
Copy link
Contributor

bobbinth commented Sep 7, 2023

Thank you for such a detailed proposal! Overall, I think this would add very nice additional capability to the VM (i.e., having fully element-addressable memory could be useful in many circumstances). A few preliminary thoughts:

Implementation within the VM

I think implementing this within the VM will be a quite tricky - but I also haven't spent a lot of time on this - so, maybe there are good solutions which I haven't considered yet.

The main issue is that we actually always have to read a full word from a memory address. It is just for cases of mem_load we put the other 3 elements into helper registers (see here). So, to select a specific element from the word, we'd need to do the following:

  1. Put all 4 elements into helper registers.
  2. Split the address into 2 parts - this would require putting lower limb into another helper register.
  3. Represent the upper limb with 2 binary registers (e.g., [0, 0] means 0, [0, 1] means 1 etc.). Then we'd be able to use these registers as selectors for selecting a specific value.

Summarizing the above means that we need 7 helper registers, but currently we have only 6. I have some thoughts on how we can increase the number of helper registers to 8, but this would require a fairly significant refactoring of the decoder.

Another potential complication is that selectors would be of degree 2, and thus, the constraint for populating the top stack slot would be of degree 3. This means that MLOAD and MSTORE operation flags cannot have degree greater than 6. This should't be a problem as we currently have some empty slots for degree 5 operations. But this is still something to consider for the future.

Instruction semantics

If I understood everything correctly, the semantics you propose would be useful when addresses are static (or known at compile time). But for dynamic addresses, we'd still run into problems. For example, if we wanted to iterate over a sequence of elements in a loop, we wouldn't be able to easily increment memory address. So, it seems to me the usefulness is relatively limited.

One other idea is to treat memory as if it was element-addressable, and then impose alignment requirements on mem_loadw and mem_storew operations. So, for example:

  • mem_load.0 would load the first element of the first word; mem_load.1 would load the second element of the first word, mem_load.4 would load the first element of the second word etc.
  • mem_loadw.0 would load the first word, mem_loadw.4 would load the second word, but mem_loadw.1 would be illegal.

I think this would be much cleaner, but there are two main downsides to this approach:

  • We still need to solve the implementation problem I described above.
  • This would probably break most programs which use memory operations.

@bitwalker
Copy link
Contributor Author

If I understood everything correctly, the semantics you propose would be useful when addresses are static (or known at compile time). But for dynamic addresses, we'd still run into problems. For example, if we wanted to iterate over a sequence of elements in a loop, we wouldn't be able to easily increment memory address. So, it seems to me the usefulness is relatively limited.

I don't think that's necessarily true, since the element offsets for all 32-bit addresses are constant, i.e. adding 0 << 32, 1 << 32, 2 << 32, or 3 << 32 to any address gives you the "paired" address for the given element of that word.

If you wanted to traverse every element of n sequential words, completely dynamically, you'd do something like the following (using pseudocode):

let base_addr: *const Word = <the address in memory of the first word, doesn't have to be constant>;
for i in 0..n {
    let word_offset: u64 = i / 4;
    let element_offset: u64 = (i % 4) << 32;
    let word_addr: u64 = base_addr as u64 + word_offset;
    let element_addr: u64 = word_addr | offset;
    let element = *(element_addr as *const Element);
    ...do stuff with element..
}

That's useful in quite a few scenarios, certainly for those in which I was intending to make use of the proposed extension.

One other idea is to treat memory as if it was element-addressable, and then impose alignment requirements on mem_loadw and mem_storew operation

I'm definitely a fan of this, as it is much cleaner right from the get go, and is similar to how on, say x86-64, certain instructions impose additional alignment requirements on their operands (e.g. SSE2 instructions), though as you mentioned it breaks backwards-compatibility. That said, I'm not sure to what extent the memory ops are used currently, so perhaps it isn't too catastrophic. I would think memory gets used pretty frequently, for precomputed constant tables and things like that, but I have no idea.

We still need to solve the implementation problem I described above.

Is that true if addresses now refer to elements not words? I would think in that case, there is no longer a need for an offset, leaving things unchanged at the instruction level. Is it because the alignment requirements that would be imposed on mem_loadw and mem_storew require additional helper registers, much in the same way my proposal would?

@frisitano
Copy link
Contributor

Having native element addressable memory support at the VM level seems the superior option to me. There are a number of instances in the transaction kernel which would benefit for this functionality.

@bobbinth
Copy link
Contributor

If you wanted to traverse every element of n sequential words, completely dynamically, you'd do something like the following (using pseudocode):

I didn't write out the actually assembly code for this, but just doing a rough estimation I think we'd incur extra 6 - 7 cycles per element read in this code. This is better than 9 - 10 cycles per read using padw, push.ADDR, mem_loadw, drop, drop, swap, drop - but it is not dramatically better. So, there is still a question in my mind whether the extra effort is worth ~30% improvement.

I'm not sure to what extent the memory ops are used currently, so perhaps it isn't too catastrophic.

Quite a few modules currently use memory operations, but almost all of them are written (and maintained) by us, and most of them a pretty small. There are two exceptions:

  • Transaction kernel which is fairly complex and relies heavily on memory. Memory access there are fairly well encapsulated and @frisitano believes that it shouldn't be too difficult to move to the new memory model.
  • Recursive verifier which is also fairly complex and also relies on memory quite a bit. @Al-Kindi-0 also believes that we could migrate it to the new memory model (though effort may be more significant than with transaction kernel).

Is that true if addresses now refer to elements not words?

No, addresses right now refer to words and we can read/write from/to memory only full words. The way we get around this for mem_load and mem_store instructions is by putting the things which don't end up on the stack into helper registers. So, we still read (or write) a full word, but part of the word goes into (or comes from) the helper registers.

As a side note, now that I look at the constraints of the mem_store operation it seems to me that there might be a soundness bug there (i.e., a malicious prover could inject incorrect values into the temporary registers, and overwrite the other 3 elements at a given address with arbitrary values). @Al-Kindi-0 - what do you think? If so, we might have to move back to the approach we used previously when each memory row in memory chiplet contained both old and new values.

Combing back to the design for new mem_load and mem_store operations. Their transition diagrams could look like so:

image

and

image

In the above:

  • $addr$ is the element address.
  • $waddr$ is the word address computed as $waddr = addr / 4$.
  • $s_0$ and $s_1$ is a binary encoding of the element index within a word. Thus, we can impose a constraint for valid address decomposition as follows: $addr \cdot 4 + 2 \cdot s_1 + s_0 = waddr$. In addition to this, we'll need constraints to enforce that $s_0$ and $s_1$ are binary, and that $add$ is a u32 value (not sure about the best way to do this yet).
  • $val_0, ..., val_3$ are the values currently stored at $waddr$.

In the case of MLOAD, the constraint for $val$ would be:

$$ val = (1 - s_0) \cdot (1 - s_1) \cdot val_0 + s_0 \cdot (1 - s_1) \cdot v_1 + (1 - s_0) \cdot s_1 \cdot v_2 + s_0 \cdot s_1 \cdot v_3 $$

The constraints for MSTORE seem to be somewhat more complicated - so, I'll need to think more about them. But in either case, we need to have at least 1 extra helper register.

@bitwalker
Copy link
Contributor Author

...I think we'd incur extra 6 - 7 cycles per element read in this code. This is better than 9 - 10 cycles per read...but it is not dramatically better. So, there is still a question in my mind whether the extra effort is worth ~30% improvement.

I agree that the savings are relatively minimal, though I do think it's a matter of frequency - if such instructions occur frequently, the cycles saved start to add up fast for a program with any significant runtime. That said, the design here is a compromise for the sake of backwards-compatibility, and is really more about ergonomics/uniformity (whether the assembly is hand-written or compiler-generated). Overall though, I much prefer the approach of making memory element-addressable in general, since that seems like an option we can actually consider. The difference in cycle count for equivalent memory accesses between the two models would be night and day I think, since it removes a lot of complexity around accessing individual elements. It does impose a bit more overhead when accessing words, but only in situations where the alignment of the address is not known, and must be checked at runtime. In practice, accessing values smaller than a word in memory is going to be far more common than >= word-sized values (at least for code compiled through Wasm), so favoring those operations is going to have a bigger performance impact.

No, addresses right now refer to words and we can read/write from/to memory only full words. The way we get around this for mem_load and mem_store instructions is by putting the things which don't end up on the stack into helper registers. So, we still read (or write) a full word, but part of the word goes into (or comes from) the helper registers.

That makes sense, though I guess what I was wondering, is if that changes at all with element-addressable memory vs word-addressable memory. In other words, would that change just make memory look element-addressable, but preserve how memory loads/stores work in the VM today (i.e. word-oriented, with some changes to the constraints as you described)? Or would it also involve reworking the implementation of memory in the VM so that it is element-oriented rather than word-oriented?

Obviously I'm looking at this through a very narrow lens, so the importance of doing things like memory accesses in word-sized units might have far reaching implications I'm not thinking about, but in an element-addressable world, it seems out of place to implement memory accesses in terms of words, particularly if mem_load/mem_store would require multiple helper registers that go essentially unused (beyond their role in the constraint system). It would obviously still be important to enforce word-alignment for mem_loadw/mem_storew, but I'm surprised it would be necessary to have the memory itself be word-oriented, if the word-alignment restriction ensures that there is always a full word at a given address.

Anyway, I want to avoid derailing the conversation here just because I don't have a solid grasp of how some of these pieces fit together in the VM internals, so if I'm missing obvious stuff here (or, obvious if you know the internals well), feel free to ignore the above. I'm certainly curious about the relationship between the memory instructions and their implementation in the VM (and how that affects the constraints), but I imagine it's probably been discussed/documented somewhere already, and I just need to go do some reading.

@bobbinth
Copy link
Contributor

That makes sense, though I guess what I was wondering, is if that changes at all with element-addressable memory vs word-addressable memory. In other words, would that change just make memory look element-addressable, but preserve how memory loads/stores work in the VM today (i.e. word-oriented, with some changes to the constraints as you described)? Or would it also involve reworking the implementation of memory in the VM so that it is element-oriented rather than word-oriented?

The approach I described above would still keep the internal mechanics of memory word-oriented. So, basically, it would just look element-addressable.

Obviously I'm looking at this through a very narrow lens, so the importance of doing things like memory accesses in word-sized units might have far reaching implications I'm not thinking about, but in an element-addressable world, it seems out of place to implement memory accesses in terms of words, particularly if mem_load/mem_store would require multiple helper registers that go essentially unused (beyond their role in the constraint system). It would obviously still be important to enforce word-alignment for mem_loadw/mem_storew, but I'm surprised it would be necessary to have the memory itself be word-oriented, if the word-alignment restriction ensures that there is always a full word at a given address.

The word-oriented design is done primarily for efficiency reasons. For the use cases we are targeting, it will be very common to read/write memory as full words rather than individual elements. For example, nodes in a Merkle tree are words, output of the RPO hash function is a word (and inputs are 3 words), assets in Miden rollup are words etc. etc. So, we frequently have a need to read or write a full word (or even two consecutive words), and I wanted to make this as efficient as possible.

Consider mem_stream for example. This instruction reads two consecutive words and overwrites top 8 elements of the stack with the result in a single cycle. Implementing this instruction with element-addressable memory would not be possible in our current design (or we'd need to add extra columns which would slow down the rest of the VM). And emulating this instruction with individual loads would probably require ~30 cycles.

@bobbinth bobbinth added this to the v0.8 milestone Oct 12, 2023
@bobbinth bobbinth modified the milestones: v0.8, v0.9 Jan 11, 2024
@bobbinth bobbinth modified the milestones: v0.9, v0.11.0 Jul 24, 2024
@bobbinth bobbinth modified the milestones: v0.11.0, v0.12 Aug 7, 2024
@plafer
Copy link
Contributor

plafer commented Nov 27, 2024

I think it's possible to switch to element-addressable memory and retain the efficiency gains of the current word-addressable memory. The memory chiplet would remain similar in that it would store words; when a single address is read/written, the whole word is read in/out (similar to how our mload/store instructions work). The only difference is that each element in the word gets its own address. Words are also always aligned (i.e. the first address in a word is always addr % 4 == 0).

Column layout sketch

Basically, I see the columns looking like (with a few left out to make the argument simpler):

r/w | e/w | addr | batch | idx0 | idx1 | clk | OLD_VALUES | NEW_VALUES

where the columns are as follows:

  • r/w: this is the current s0 flag; 0 for writes, 1 for reads
  • e/w: a flag whether a full word or just one element was accessed
    • say = 0 when element, = 1 when word
  • addr: the address of the word/element accessed
    • if unit/word == 1, then addr is aligned
  • batch/idx0/idx1: these are the decomposition of addr, where addr = 4*batch + idx
    • and idx = 2*idx1 + idx0
    • Since idx < 4 due to the bit decomposition, then every addr has a single batch/idx decomposition
    • Hence, we can think of batch as the index of the word
  • OLD_VALUES and NEW_VALUES: this follows from bobbin's proposition in the previous comment to address the current bug in how the current mstore works
    • except in this design, addressof OLD_VALUES[0] == NEW_VALUES[0] == batch, addressof OLD_VALUES[1] == NEW_VALUES[1] == batch + 1, and so on

Constraints sketch

To ensure that a READ is correct, we check

rw * (OLD_VALUES[i] - NEW_VALUES[i]) = 0 ; i = 0,1,2,3

To ensure the correctness of writes, we define flags f_i such that f_i == 1 when addr % 4 == i:

  • f_0 = (1-idx1) * (1-idx0)
  • f_1 = (1-idx1) * idx0
  • f_2 = idx1 * (1-idx0)
  • f_3 = idx1 * idx0

Then, when a single element is written (mstore), we constrain the other 3 values to be unchanged:

(1 - rw) * f0 * (OLD_VALUES[i] - NEW_VALUES[i]) = 0; i = 1,2,3
(1 - rw) * f1 * (OLD_VALUES[i] - NEW_VALUES[i]) = 0; i = 0,2,3
(1 - rw) * f2 * (OLD_VALUES[i] - NEW_VALUES[i]) = 0; i = 0,1,3 
(1 - rw) * f3 * (OLD_VALUES[i] - NEW_VALUES[i]) = 0; i = 0,1,2

We also ensure that when a word is read/written (mstorew/mloadw), that addr is aligned. This is equivalent to checking that idx = 0 (i.e. the remainder when dividing by 4 is 0):

ew * (1 - idx1) * (1 - idx0) = 0

Conclusion

With element-addressable memory, we'd have 4GB memory instead of 16GB, but I don't think that's an issue. Unless I missed something, I think we should move to element-addressable memory as it solves the compiler's problems, and still retains the efficiency of the current design.

Note: we'd need a few changes to some instructions, where e.g. MSTREAM would increment the address by 8 instead of 2.

@bobbinth
Copy link
Contributor

bobbinth commented Dec 1, 2024

Thank you! I'm not sure I fully understand the design yet - so, a coupe of questions:

  1. To implement memory, we usually need to sort the trace by address and then by clock cycle. What are we sorting by in this design? I'm assuming it is by batch, right?
  2. What actually goes onto the bus for read/writes? Specifically, how do full word read/writes look different from single element read/writes?
  3. How does memory access look like on the stack side (i.e., the other side of the memory bus)?

@plafer
Copy link
Contributor

plafer commented Dec 2, 2024

Right, I should have addressed those in the initial post.

  1. To implement memory, we usually need to sort the trace by address and then by clock cycle. What are we sorting by in this design? I'm assuming it is by batch, right?

That's right. To give a better intuition, I'll give an example of a few load/stores and how to chiplet columns would be built:

(clk)       
  1   mstore(addr=80, value=2)
  2   mstore(addr=42, value=100)
  3   mstore(addr=45, value=200)
  4   mload(addr=42)
  5   mstorew(addr=40, value=[1,2,3,4])  # This writes to addresses 40,41,42,43
  6   mloadw(addr=44)
 rw  |  ew  | addr | batch | idx0 | idx1 | clk | OLD_VALUES | NEW_VALUES
========================================================================
 w      e      42     10      0      1      2    [0,0,0,0]    [0,0,100,0]  # mstore(addr=42, value=100)
 r      e      42     10      0      1      4    [0,0,100,0]  [0,0,100,0]  # mload(addr=42)
 w      w      40     10      0      0      5    [0,0,100,0]  [1,2,3,4]    # mstorew(addr=40, value=[1,2,3,4])
 w      e      45     11      1      0      3    [0,0,0,0]    [0,200,0,0]  # mstore(addr=45, value=200)
 r      e      44     11      0      0      6    [0,200,0,0]  [0,200,0,0]  # mloadw(addr=44)
 w      e      80     20      0      0      1    [0,0,0,0]    [2,0,0,0]    # mstore(addr=80, value=2)
  1. What actually goes onto the bus for read/writes? Specifically, how do full word read/writes look different from single element read/writes?
  2. How does memory access look like on the stack side (i.e., the other side of the memory bus)?

The bus for element-wise read/writes only needs to include the element being read/written; the bus message for word read/writes is unchanged from the current version. Note that the instructions' API do not change (except that the address for mstorew and mloadw now needs to be aligned).

Bus message for word read/writes

The message looks like

$$ x = \alpha_0 + \alpha_1 \cdot op_{mem} + \alpha_2 \cdot \mathrm{ctx} + \alpha_3 \cdot \mathrm{addr} + \alpha_4 \cdot \mathrm{clk} + \sum_{i=5}^8 \alpha_{i} v_i $$

On the chiplet side, $v_i$ corresponds to the NEW_VALUES[0..3]; the other variables come from the corresponding columns. Note that the bus cares about the address, not the batch.

On the stack side, the $v_i$'s are all on the stack.

Bus message for element read/writes

The message looks like

$$ y = \alpha_0 + \alpha_1 \cdot op_{mem} + \alpha_2 \cdot \mathrm{ctx} + \alpha_3 \cdot \mathrm{addr} + \alpha_4 \cdot \mathrm{clk} + \alpha_5 \cdot v $$

Here, $v$ is the element being written/read.

On the chiplet side, $v = \sum_{i=0}^3 f_i \cdot \mathrm{NEW\_VALUES}[i]$. That is, the $f_i$'s select the appropriate column in NEW_VALUES.

On the stack side, $v$ is simply on the stack. Note that in this design, we don't populate the helper registers with unconstrained values.

Bus (putting it all together)

Reusing the notation from the docs,

$$ v_{mem} = ew \cdot x + (1 - ew) \cdot y $$

$v_{mem}$ corresponds to the chiplet bus response; the chiplet bus request $u_{mem}$ (from the stack) is analogous (where $x$ and $y$ are constructed as described above for the stack side). As usual, the final bus equation is

$$ b_{chip}' \cdot u_{mem} = b_{chip} \cdot v_{mem} $$

@bobbinth
Copy link
Contributor

bobbinth commented Dec 3, 2024

Thank you for this explanation! Makes things much more clear. To summarize my understanding:

  1. In the current approach, there are 2 types of memory bus messages - either read or write a full word. Then, on the stack side we figure out how to process these messages based on w/e instruction is being executed.
  2. In the proposed approach, there are 4 types of memory bus messages: 2 for reading/writing words and 2 for reading/writing elements. This simplifies message handling on the stack side and we don't need to introduce extra helper columns there.

A few more questions:

  • Upon a quick look, it seems like we now need 15 columns for the memory chiplet. This would push the total number of chiplet columns from the current 17 to 18, right? If so, I wonder if there is a way to simplify something so that don't have to increase the total number of columns. Though, maybe that's not too big of a concern for now.
  • How do the degrees of the constraints look now? I haven't done the math, but memory chiplet already had constraints of degree 9. Now we are introducing some additional selectors and I'm wondering if this will push the degree beyond that.

@plafer
Copy link
Contributor

plafer commented Dec 5, 2024

I figured out how to make it work with 15 columns total - which wouldn't require adding any new columns if we make the memory chiplet the second one right after the hasher chiplet (instead of the bitwise).

Memory chiplet columns

The columns would look like:

f_scb | rw | ew | ctx | batch | idx0 | idx1 | clk | VALUES | d0 | d1 | t

Notable changes are:

  • I included all the remaining columns that I omitted in previous posts (i.e. ctx, d0, d1, and t)
  • f_scb is 1 when the context and batch don't change as well as whether the current and next rows are in the memory chiplet, and 0 otherwise
    • i.e. $f_{scb} = [1 - (n_0 + (1-n_0)n_1)] \cdot f_{mem} \cdot f_{mem}' | degree=8$,
      • where $n_0$ and $n_1$ are reused from our docs, with the slight modification that $n_1 = \Delta batch \cdot t'$.
    • scb is a mnemonic for "same context and batch"
    • (we need this to reduce the degree of some constraints)
  • We no longer commit to addr
    • It is now a degree 1 "virtual" column: addr = 4*batch + 2*idx1 + idx0
  • We got rid of the 4 OLD_VALUES columns

Memory chiplet constraints

EDIT: See #1064 (comment) for an updated version of these constraints

The ctx, batch, clk, d0, d1 and t columns work exactly the same as they currently do (where the former addr column is replaced with batch in this version). We already described how f_scb works. So it remains to show how we constrain the VALUES columns.

We will make use of the following variables:

$$ \begin{align} f_{mem} &= s_0 \cdot (1 - s_1) \quad | degree = 2 \\ &\rightarrow \textrm{is 1 when the current row is the memory chiplet, 0 otherwise} \\ f_{firstrow} &= (1 - s_0) \cdot f_{mem}' \quad | degree = 3 \\ &\rightarrow \textrm{is 1 when the next row is the first row of the memory chiplet, and 0 otherwise} \\ f_0 &= (1-idx1) \cdot (1 - idx0) \quad | degree = 2 \\ f_1 &= (1-idx1) \cdot idx0 \quad | degree = 2 \\ f_2 &= idx1 \cdot (1 - idx0) \quad | degree = 2 \\ f_3 &= idx1 \cdot idx0 \quad | degree = 2 \\ &\rightarrow \textrm{these are the same flags as described in the previous posts} \\ \end{align} $$

We will also use v[i] = VALUES[i] for conciseness.

First row constraints

This subsection describes the constraints for the first row of the memory chiplet.

  1. If first row is a read (element or word), then VALUES = [0,0,0,0]. For $0 \leq i &lt; 4$,

$$ f_{firstrow} \cdot rw' \cdot v'[i] = 0 \quad | degree = 5 $$

  1. If an element is written, then all VALUES are 0 except for the element written. For $0 \leq j &lt; 4$, and for all $i \neq j$,

$$ f_{firstrow} \cdot (1 - rw') \cdot (1 - ew') \cdot f_j' \cdot v'[i] = 0 \quad | degree = 8 $$

Note: We can combine some of the above 12 constraints (where some will be degree 9), but I omit this optimization here for notational convenience.

When a word is written, VALUES is unconstrained (well, only constrained by the bus).

Non-first row constraints

  1. When a read occurs in the same ctx and batch, VALUES don't change. For $0 \leq i &lt; 4$,

$$ f_{scb} \cdot rw' \cdot ( v'[i] - v[i]) = 0 \quad | degree = 3 $$

  1. When an element write occurs in the same ctx and batch, then VALUES remains unchanged except for the index that was written. For 0 ≤ j < 4 , and for all i ≠ j , (12 constraints)

$$ \begin{align} f_{scb} \cdot (1 - rw') \cdot (1 - ew') \cdot f_j' \cdot (v'[i] - v[i]) = 0 \quad | degree = 6 \end{align} $$

  1. When a read occurs in a different ctx or batch, VALUES = [0,0,0,0]. For $0 \leq i &lt; 4$,

$$ f_{mem}' \cdot (1 - f_{scb}) \cdot rw' \cdot v'[i] = 0 \quad | degree = 8 $$

  1. When an element write occurs in a different ctx or batch, all VALUES are 0 except for the element written. For $0 \leq j &lt; 4$, and for all $i \neq j$, (12 constraints)

$$ f_{scb} \cdot (1 - rw') \cdot (1 - ew') \cdot f_j' \cdot v'[i] = 0 \quad | degree = 6 $$

Bus

The bus works just as described in the previous post, with the slight caveat that addr is no longer a committed column (as previously explained). However, since addr is degree 1, the degree of the constraints for the bus don't change.

Bitwise chiplet

As mentioned by Bobbin offline, the bitwise chiplet is currently right after the hasher chiplet because they both require their rows to be aligned on boundaries of 8.

The solution is to add 1 column en (for "enabled") to the bitwise chiplet. This will bring the chiplet to 14 columns, which wouldn't require adding any columns to the VM. The "enabled" column works as follows:

  1. It can either be 0 or 1:

$$ f_{bitwise} \cdot en^2 \cdot en = 0 | degree = 6 $$

  1. If it is set to 1, then the next row also needs to be set to 1:

$$ f_{bitwise} \cdot f_{bitwise}' \cdot en \cdot (1 - en') = 0 | degree = 8 $$

This one constraint encapsulates the behavior that the $en$ column can be 0 at the start, but as soon as it switches to $1$, it must remain at 1.

  1. When (rather if) $en$ transitions from 0 to 1, then the $1$ must be at a row aligned on boundaries of 8 (that is, $k_0 = 1$).

$$ f_{bitwise} \cdot f_{bitwise}' \cdot (1 - en) \cdot en' \cdot k_0 = 0 | degree = 9 $$

  1. All constraints are modified to also multiply by $en$ - that is, they are only active if the $en$ column is set.

When constructing the chiplet rows, the prover can then pad the right number of rows to reach the proper alignment of 8 at the start (setting the $en$ column to 0), and then generate the rest of the rows as usual (setting $en = 1$).

Conclusion

Please double check everything, but assuming that this is right, I think we should definitely move to element-addressable memory, as we can do it without adding any column. There might also be a simpler set of constraints (e.g. that merge some of the "first row" and "not-first row" constraints), but at least we know of one set of constraints that works.

@Al-Kindi-0
Copy link
Collaborator

Thank you for this proposal.
A few questions to clarify my understanding:

  1. The s0 and s1 selectors are probably the (global) chiplet selectors, right?
  2. This is probably related to the first point, but I am not sure I follow the interpretation of the $f_{\text{firstrow}}$. If we want to select the first row of the memory chiplet, assuming it is right after the hasher, then I believe that expression should be changed.
  3. I believe the constraints for first row read should be separated for each element. Also, and I think this is related to the previous two points, I am not following the use of the next modifier on some of the terms in that constraint.
  4. In the non-first-row constraints, since $f_0$ is used, I believe the inequality should be strict $0 &lt; i &lt; 4$. For the other flags $f_i$, a variation on this idea should apply too, basically point 4 in your comment.
  5. Same comment for point 3 in non-first-row constraints about separating the constraints.

Overall, I think this looks promising and we should go for it once we are all on-board.

@bobbinth
Copy link
Contributor

bobbinth commented Dec 6, 2024

Not a super detailed review of the proposal yet from my end. Overall, I think this approach should work - but a couple of questions comments:

First, shouldn't $f_{mem}$ be defined as $s_0 \cdot s_1 \cdot (1 - s_2)$? That is, memory chiplet uses 3 external selector columns. If so, then the degree of some constraints would increase by 1 or 2.

Second, I wonder if we could simplify constraint description by defining a flag per value which indicates whether the value should be copied over from the previous row. We could call this flag something like $c_i$ - one for each of the $v$ values. For example, $c_0$ could look something like this:

$$ c_0 = (f_0 + f_{ew} - f_0 \cdot f_{ew}) \cdot (1 - f_{rw}) | degree = 4 $$

Basically, we are saying that $c_0 = 1$ if (($f_0 == 1$ OR $f_{ew} == 1$) AND $f_{rw} == 0$).

Using $c_i$ flags, we should be able to express many of the read/write and word/element constraints more uniformly. Specifically:

$$ c_i \cdot (v'_i - v_i) = 0 | degree = 5 $$

Should be sufficient to define all non-bus constraints for the $v_i$ columns.

We can also extend this to handle "new batch" situations. Assuming we have $f_{scb}$ column which indicates that we are creating the first row for a given address, we may be able to write the constraint as:

$$ c_i \cdot (v_i' - (1 - f_{scb}) \cdot v_i ) = 0 | degree = 6 $$

@plafer
Copy link
Contributor

plafer commented Dec 6, 2024

Response to Al

  1. The s0 and s1 selectors are probably the (global) chiplet selectors, right?

Correct.

  1. This is probably related to the first point, but I am not sure I follow the interpretation of the $f_{firstrow}$ . If we want to select the first row of the memory chiplet, assuming it is right after the hasher, then I believe that expression should be changed.

Ah yes, I changed it to $f_{firstrow} = (1 - s_0) \cdot f_{mem}'$. So this evaluates to 1 if the current row is the hasher chiplet ($1 - s_0$), and the next row is the memory chiplet ($f_{mem}'$).

  1. I believe the constraints for first row read should be separated for each element.

Fixed.

Also, and I think this is related to the previous two points, I am not following the use of the next modifier on some of the terms in that constraint.

And based on answer (2), the way to think about $f_{firstrow}$ is really "next row is first memory chiplet row", which explains the next modifiers. Or in plain terms, if we don't use them, then we're looking at the hasher chiplet, which is obviously wrong.

  1. In the non-first-row constraints, since $f_0$ is used, I believe the inequality should be strict $0 &lt; i &lt; 4$ . For the other flags $f_i$ , a variation on this idea should apply too, basically point 4 in your comment.
  2. Same comment for point 3 in non-first-row constraints about separating the constraints.

Fixed

Response to Bobbin

First, shouldn't $f_mem$ be defined as s 0 ⋅ s 1 ⋅ ( 1 − s 2 ) ? That is, memory chiplet uses 3 external selector columns. If so, then the degree of some constraints would increase by 1 or 2.

My proposal assumes that we put the memory chiplet right after the hasher chiplet (in place of the current bitwise chiplet).

Second, I wonder if we could simplify constraint description by defining a flag per value which indicates whether the value should be copied over from the previous row.

I'm not sure I follow how $c_0$ is defined. First, values will only ever need to be copied in the same context & batch, so I'd expect $f_{scb}$ in there. Also, when we write a word, $f_0 = 1$ (since index is always 0 for words), $f_{ew} = 1$, and $f_{rw} = 0$, so $c_0 = (1 + 1 - 1)(1 - 0) = 1$. But writing a word is a case where $c_0$ should be $0$ (i.e. we don't expect the value to be copied over from the previous row).

We can also extend this to handle "new batch" situations. Assuming we have $f_{scb}$ column which indicates that we are creating the first row for a given address, we may be able to write the constraint as:

$$ c_i \cdot (v_i' - (1 - f_{scb}) \cdot v_i ) = 0 | degree = 6 $$

Not sure I understand how this one is built either. If we are in a "new batch" situation, then $f_{scb} = 0$. So the above is simplified to

$$ c_i \cdot (v_i' - v_i ) = 0 $$

But to my previous comment, I'd always expect $c_i = 0$, since we are in a new batch and no element should be copied across a new batch. This doesn't enforce that all $v_i$ are 0 either.

@bobbinth
Copy link
Contributor

bobbinth commented Dec 8, 2024

I'm not sure I follow how $c_0$ is defined. First, values will only ever need to be copied in the same context & batch, so I'd expect $f_{scb}$ in there. Also, when we write a word, $f_0 = 1$ (since index is always 0 for words), $f_{ew} = 1$, and $f_{rw} = 0$, so $c_0 = (1 + 1 - 1)(1 - 0) = 1$. But writing a word is a case where $c_0$ should be $0$ (i.e. we don't expect the value to be copied over from the previous row).

Yes - I made quite a few mistakes in my logic here and I'll describe a (hopefully) more correct approach which illustrates what I was thinking. But first I think there may be an issue with $f_{scb}$ flag.

Specifically, I'm not sure we can rely solely on this flag to identify memory chiplet constraints. That is, every memory chiplet constraint still has to be multiplied by $f_{mem}$. The reasoning is that since $f_{scb}$ is a committed value, it can so happen that this value will be set to $1$ in other chiplets as well (e.g., in a hasher chiplet or in bitwise chiplet etc.). If this happens, some "irrelevant" constraints may be turned on in these places which could cause completness issues (maybe soundness issues are possible as well, but I haven't thought this through).

So, I'd propose to redefine this flag as $f_{scb} = [1 - (n_0 + (1-n_0) \cdot n_1)] \cdot f_{mem}'$. Basically, it would be $1$ if the context and batch don't change and the next row is in the memory chiplet. Or, conversely, this flag is set to $0$ for the first row of a trace for a given context+batch, and for the very last row of the memory chiplet.

Now, using this definition of $f_{scb}$, I'll try to define a $c_i$ value which is $1$ when the value of $v_i$ should be copied over from the previous row and $0$ otherwise. But let's build up to it step by step.

First, at the high-level, the value of $v_i$ needs to be copied over to the next row if: (1) we are doing a read operation, OR (2) we are doing a write operation and the $v_i$ is note selected for the update. So, let's first define a flag $z$ which will be $1$ when a value is not selected for the update. It would look something like this:

$$ z_i = (1 - f_{ew}) \cdot (1 - f_i) $$

Basically, this flag is $1$ only when both $f_{ew} = 0$ (we are not operating on the entire word) and $f_i = 0$ (we are not operating on this specific value).

Using this definition of $z_i$ and assuming $f_{rw} = 1$ for read operations, we can define $c_i$ as:

$$ c_i = f_{rw} + z_i - f_{rw} \cdot z_i | degree = 4 $$

But as you mentioned in your comment, we want this flag to be set to $1$ only if the context/batch combination doesn't change - and so, we can update the definition of $c_i$ to be:

$$ c_i = f_{scb} \cdot (f_{rw} + z_i - f_{rw} \cdot z_i) | degree = 5 $$

Then, the constraint that we want to express becomes: when we are in the memory chiplet and $c_i = 1$, the value of $v_i$ should be copied over:

$$ f_{mem} \cdot c_i \cdot (v'_i - v_i) = 0 | degree = 9 $$

The degree $9$ here assumes that we keep the order of chiplets the same, but if we re-arrange them as you've suggested, it would drop to 8.

The above constraint should cover all $v_i$ transitions except for initializing the first row of the trace for each batch. I haven't though through these constraints - so, not 100% sure this structure will work for them, but may be worth exploring.

@bobbinth
Copy link
Contributor

bobbinth commented Dec 8, 2024

The solution is to add 1 column en (for "enabled") to the bitwise chiplet. This will bring the chiplet to 14 columns, which wouldn't require adding any columns to the VM. The "enabled" column works as follows:

Do we also need a constraint to ensure that $en$ value can be flipped from $0$ to $1$ only on rows which are multiples of $8$?

@plafer
Copy link
Contributor

plafer commented Dec 9, 2024

Ah yes, those constraints work!

Do we also need a constraint to ensure that e n value can be flipped from 0 to 1 only on rows which are multiples of 8 ?

Ah yes we need that, I will fix it in the previous comment.

The above constraint should cover all $v_i$ transitions except for initializing the first row of the trace for each batch. I haven't though through these constraints - so, not 100% sure this structure will work for them, but may be worth exploring.

I think I found a way to extend your idea to the "first row for each batch" case. I will assume that we make the memory chiplet come right after the hasher chiplet again in order to not have to add a new column to the VM for $f_{scb}$. I will also be reusing your definition of $f_{scb}$.

I will adjust slightly your definition of $c_i$ to be:

$$ c_i = f_{rw}' + (1 - f_{rw}') (1 - f_{ew}') (1 - f_i') | degree=4 $$

which is $1$ when $v'[i]$ is not written to, and $0$ otherwise. Hence, the way to think about $c_i$ is "everytime $v'[i]$ needs to be constrained". Depending on the context, this constraint will either be "the same as $v[i]$" or 0.

first row

For $0 \leq i &lt; 4$,

$$ f_{firstrow} \cdot c_i \cdot v'[i] = 0 | degree=8 $$

That is, if $row'$ is the first first row of the memory chiplet, and $v'[i]$ is not written to, then $v'[i]$ must be 0.

non first row, new batch/context

For $0 \leq i &lt; 4$,

$$ f_{mem} \cdot (1 - f_{scb}) \cdot c_i \cdot v'[i] = 0 | degree=8 $$

That is, $row'$ is a new batch/ctx in the memory chiplet, and $v'[i]$ is not written to, then $v'[i]$ must be 0.

non first row, same batch/context

For $0 \leq i &lt; 4$,

$$ f_{mem} \cdot f_{scb} \cdot c_i \cdot (v'[i] - v[i]) = 0 | degree=8 $$

That is, $row'$ is in the same batch/ctx in the memory chiplet, and $v'[i]$ is not written to, then $v'[i]$ must be equal to $v[i]$.

@bobbinth
Copy link
Contributor

Nice! I didn't go through all the constraints in detail, but on a quick look this should work. One question: if we keep the order of chiplets the same as now, would the degrees of these constraints go up to $9$? If so, maybe it is worth keeping the order the same as it will much less work.

@Al-Kindi-0
Copy link
Collaborator

Double checked the constraints and all looks good.
Two points:

  1. There is a mismatch between the different comments on the definition of the flag $f_{rw}$.
  2. Given the above, the definition of the $c_i$ flag is the same in all of the last three comments, modulo the factor $f_{scb}$.
  3. Given the above, I would define $c_i$ in the penultimate comment using the intermediate value $z_i$ as this is cleaner and more principled.

@plafer
Copy link
Contributor

plafer commented Dec 17, 2024

I just noticed a flaw in the design: decomposing addr = 4*batch + idx, idx < 4 will force the prover to write e.g. for addr=41, batch=10 and idx=1 only if we were in the integers. In a field, we have a division operation (defined over all pairs of elements), so for addr=41, the prover could just write batch=41/4 and idx=0.

The fix for this is to do a 32-bit range check on batch and 4*batch + idx - hence, ensuring the "integer behavior" that we want.

Note: technically we want a 32-bit range check on 4*batch, but we check 4*batch + idx since it both checks that 4*batch < 2^32 (due to 0<=idx<4), and also constrains addr = 4*batch + idx < 2^32 (a check we don't have today).

Repercussions

This means we'll need an extra column in the chiplet (since batch will need to be split in its 16-bit parts), and we'll need an extra logup bus column for the new range checks (since the current b_chip is full). From offline discussions, we were going to need more columns for the hasher chiplet anyways, so the actual extra cost here would be the extra auxiliary trace column. But arguably our current memory constraints lack a range check on addr, so we'd need that auxiliary column anyways for correctness.

Appendix

This time I double-checked my intuition by proving (experimentally) that

  • x % 4 = 0 if and only if x / 4 < 2^32, and (equivalently)
  • x % 4 != 0 if and only if x / 4 >= 2^32

In other words, x/4 will only be "small" if x is "divisible by 4" in the integer sense. For example, 40 / 4 = 10 (mod MODULUS), which is indeed smaller than 2^32. But 41/4 = 13835058052060938251 (mod MODULUS) is greater than 2^32.

MODULUS is the modulus for the Goldilocks field

Below is the code to check these claims:

fn main() {
    const FOUR: Felt = vm_core::Felt::new(4);

    let max = u32::MAX / 4;
    let two_power_32 = 2_u64.pow(32);

    for batch_idx in 0..max {
        let x = Felt::from(batch_idx * 4);

        assert!((x / FOUR).as_int() < two_power_32, "x: {x}");
        assert!(((x + 1_u32.into()) / FOUR).as_int() >= two_power_32, "x: {x}");
        assert!(((x + 2_u32.into()) / FOUR).as_int() >= two_power_32, "x: {x}");
        assert!(((x + 3_u32.into()) / FOUR).as_int() >= two_power_32, "x: {x}");
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants