-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Extend the behavior of load/store ops to allow addressing all elements of a word #1064
Comments
Thank you for such a detailed proposal! Overall, I think this would add very nice additional capability to the VM (i.e., having fully element-addressable memory could be useful in many circumstances). A few preliminary thoughts: Implementation within the VMI think implementing this within the VM will be a quite tricky - but I also haven't spent a lot of time on this - so, maybe there are good solutions which I haven't considered yet. The main issue is that we actually always have to read a full word from a memory address. It is just for cases of
Summarizing the above means that we need 7 helper registers, but currently we have only 6. I have some thoughts on how we can increase the number of helper registers to 8, but this would require a fairly significant refactoring of the decoder. Another potential complication is that selectors would be of degree 2, and thus, the constraint for populating the top stack slot would be of degree 3. This means that Instruction semanticsIf I understood everything correctly, the semantics you propose would be useful when addresses are static (or known at compile time). But for dynamic addresses, we'd still run into problems. For example, if we wanted to iterate over a sequence of elements in a loop, we wouldn't be able to easily increment memory address. So, it seems to me the usefulness is relatively limited. One other idea is to treat memory as if it was element-addressable, and then impose alignment requirements on
I think this would be much cleaner, but there are two main downsides to this approach:
|
I don't think that's necessarily true, since the element offsets for all 32-bit addresses are constant, i.e. adding If you wanted to traverse every element of
That's useful in quite a few scenarios, certainly for those in which I was intending to make use of the proposed extension.
I'm definitely a fan of this, as it is much cleaner right from the get go, and is similar to how on, say x86-64, certain instructions impose additional alignment requirements on their operands (e.g. SSE2 instructions), though as you mentioned it breaks backwards-compatibility. That said, I'm not sure to what extent the memory ops are used currently, so perhaps it isn't too catastrophic. I would think memory gets used pretty frequently, for precomputed constant tables and things like that, but I have no idea.
Is that true if addresses now refer to elements not words? I would think in that case, there is no longer a need for an offset, leaving things unchanged at the instruction level. Is it because the alignment requirements that would be imposed on |
Having native element addressable memory support at the VM level seems the superior option to me. There are a number of instances in the transaction kernel which would benefit for this functionality. |
I didn't write out the actually assembly code for this, but just doing a rough estimation I think we'd incur extra 6 - 7 cycles per element read in this code. This is better than 9 - 10 cycles per read using
Quite a few modules currently use memory operations, but almost all of them are written (and maintained) by us, and most of them a pretty small. There are two exceptions:
No, addresses right now refer to words and we can read/write from/to memory only full words. The way we get around this for As a side note, now that I look at the constraints of the Combing back to the design for new and In the above:
In the case of The constraints for |
I agree that the savings are relatively minimal, though I do think it's a matter of frequency - if such instructions occur frequently, the cycles saved start to add up fast for a program with any significant runtime. That said, the design here is a compromise for the sake of backwards-compatibility, and is really more about ergonomics/uniformity (whether the assembly is hand-written or compiler-generated). Overall though, I much prefer the approach of making memory element-addressable in general, since that seems like an option we can actually consider. The difference in cycle count for equivalent memory accesses between the two models would be night and day I think, since it removes a lot of complexity around accessing individual elements. It does impose a bit more overhead when accessing words, but only in situations where the alignment of the address is not known, and must be checked at runtime. In practice, accessing values smaller than a word in memory is going to be far more common than >= word-sized values (at least for code compiled through Wasm), so favoring those operations is going to have a bigger performance impact.
That makes sense, though I guess what I was wondering, is if that changes at all with element-addressable memory vs word-addressable memory. In other words, would that change just make memory look element-addressable, but preserve how memory loads/stores work in the VM today (i.e. word-oriented, with some changes to the constraints as you described)? Or would it also involve reworking the implementation of memory in the VM so that it is element-oriented rather than word-oriented? Obviously I'm looking at this through a very narrow lens, so the importance of doing things like memory accesses in word-sized units might have far reaching implications I'm not thinking about, but in an element-addressable world, it seems out of place to implement memory accesses in terms of words, particularly if Anyway, I want to avoid derailing the conversation here just because I don't have a solid grasp of how some of these pieces fit together in the VM internals, so if I'm missing obvious stuff here (or, obvious if you know the internals well), feel free to ignore the above. I'm certainly curious about the relationship between the memory instructions and their implementation in the VM (and how that affects the constraints), but I imagine it's probably been discussed/documented somewhere already, and I just need to go do some reading. |
The approach I described above would still keep the internal mechanics of memory word-oriented. So, basically, it would just look element-addressable.
The word-oriented design is done primarily for efficiency reasons. For the use cases we are targeting, it will be very common to read/write memory as full words rather than individual elements. For example, nodes in a Merkle tree are words, output of the RPO hash function is a word (and inputs are 3 words), assets in Miden rollup are words etc. etc. So, we frequently have a need to read or write a full word (or even two consecutive words), and I wanted to make this as efficient as possible. Consider |
I think it's possible to switch to element-addressable memory and retain the efficiency gains of the current word-addressable memory. The memory chiplet would remain similar in that it would store words; when a single address is read/written, the whole word is read in/out (similar to how our Column layout sketchBasically, I see the columns looking like (with a few left out to make the argument simpler):
where the columns are as follows:
Constraints sketchTo ensure that a
To ensure the correctness of writes, we define flags
Then, when a single element is written (
We also ensure that when a word is read/written (
ConclusionWith element-addressable memory, we'd have 4GB memory instead of 16GB, but I don't think that's an issue. Unless I missed something, I think we should move to element-addressable memory as it solves the compiler's problems, and still retains the efficiency of the current design. Note: we'd need a few changes to some instructions, where e.g. |
Thank you! I'm not sure I fully understand the design yet - so, a coupe of questions:
|
Right, I should have addressed those in the initial post.
That's right. To give a better intuition, I'll give an example of a few load/stores and how to chiplet columns would be built:
The bus for element-wise read/writes only needs to include the element being read/written; the bus message for word read/writes is unchanged from the current version. Note that the instructions' API do not change (except that the address for Bus message for word read/writesThe message looks like On the chiplet side, On the stack side, the Bus message for element read/writesThe message looks like Here, On the chiplet side, On the stack side, Bus (putting it all together)Reusing the notation from the docs, |
Thank you for this explanation! Makes things much more clear. To summarize my understanding:
A few more questions:
|
I figured out how to make it work with 15 columns total - which wouldn't require adding any new columns if we make the memory chiplet the second one right after the hasher chiplet (instead of the bitwise). Memory chiplet columnsThe columns would look like:
Notable changes are:
Memory chiplet constraints
The We will make use of the following variables: We will also use First row constraintsThis subsection describes the constraints for the first row of the memory chiplet.
When a word is written, Non-first row constraints
BusThe bus works just as described in the previous post, with the slight caveat that Bitwise chipletAs mentioned by Bobbin offline, the bitwise chiplet is currently right after the hasher chiplet because they both require their rows to be aligned on boundaries of 8. The solution is to add 1 column
This one constraint encapsulates the behavior that the
When constructing the chiplet rows, the prover can then pad the right number of rows to reach the proper alignment of 8 at the start (setting the ConclusionPlease double check everything, but assuming that this is right, I think we should definitely move to element-addressable memory, as we can do it without adding any column. There might also be a simpler set of constraints (e.g. that merge some of the "first row" and "not-first row" constraints), but at least we know of one set of constraints that works. |
Thank you for this proposal.
Overall, I think this looks promising and we should go for it once we are all on-board. |
Not a super detailed review of the proposal yet from my end. Overall, I think this approach should work - but a couple of questions comments: First, shouldn't Second, I wonder if we could simplify constraint description by defining a flag per value which indicates whether the value should be copied over from the previous row. We could call this flag something like Basically, we are saying that Using Should be sufficient to define all non-bus constraints for the We can also extend this to handle "new batch" situations. Assuming we have |
Response to Al
Correct.
Ah yes, I changed it to
Fixed.
And based on answer (2), the way to think about
Fixed Response to Bobbin
My proposal assumes that we put the memory chiplet right after the hasher chiplet (in place of the current bitwise chiplet).
I'm not sure I follow how
Not sure I understand how this one is built either. If we are in a "new batch" situation, then But to my previous comment, I'd always expect |
Yes - I made quite a few mistakes in my logic here and I'll describe a (hopefully) more correct approach which illustrates what I was thinking. But first I think there may be an issue with Specifically, I'm not sure we can rely solely on this flag to identify memory chiplet constraints. That is, every memory chiplet constraint still has to be multiplied by So, I'd propose to redefine this flag as Now, using this definition of First, at the high-level, the value of Basically, this flag is Using this definition of But as you mentioned in your comment, we want this flag to be set to Then, the constraint that we want to express becomes: when we are in the memory chiplet and The degree The above constraint should cover all |
Do we also need a constraint to ensure that |
Ah yes, those constraints work!
Ah yes we need that, I will fix it in the previous comment.
I think I found a way to extend your idea to the "first row for each batch" case. I will assume that we make the memory chiplet come right after the hasher chiplet again in order to not have to add a new column to the VM for I will adjust slightly your definition of which is first rowFor That is, if non first row, new batch/contextFor That is, non first row, same batch/contextFor That is, |
Nice! I didn't go through all the constraints in detail, but on a quick look this should work. One question: if we keep the order of chiplets the same as now, would the degrees of these constraints go up to |
Double checked the constraints and all looks good.
|
I just noticed a flaw in the design: decomposing The fix for this is to do a 32-bit range check on
RepercussionsThis means we'll need an extra column in the chiplet (since AppendixThis time I double-checked my intuition by proving (experimentally) that
In other words,
Below is the code to check these claims: fn main() {
const FOUR: Felt = vm_core::Felt::new(4);
let max = u32::MAX / 4;
let two_power_32 = 2_u64.pow(32);
for batch_idx in 0..max {
let x = Felt::from(batch_idx * 4);
assert!((x / FOUR).as_int() < two_power_32, "x: {x}");
assert!(((x + 1_u32.into()) / FOUR).as_int() >= two_power_32, "x: {x}");
assert!(((x + 2_u32.into()) / FOUR).as_int() >= two_power_32, "x: {x}");
assert!(((x + 3_u32.into()) / FOUR).as_int() >= two_power_32, "x: {x}");
}
} |
Background
I'm currently finishing up portions of the code generator in the compiler for Miden IR, and something came up as I was working through how to translate memory operations in the IR (which is byte-addressable) to Miden Assembly (which as we all know is word-addressable).
In particular, I noticed that the
mem_load
andmem_store
instructions, which load/store an individual field element, rather than a word, only work for the first element of the word at the given address. Up until now, I had been operating under the assumption that those instructions could load/store any element of the word, but had missed that the addresses are word-aligned, thus making it (at least at first glance) impossible to represent a load for, say, the 3rd element of a word. In short, this significantly reduces the usefulness and generality of these instructions.Proposal
I would like to suggest a change to these instructions that not only addresses this gap in functionality, but is backwards-compatible to boot.
The following are true of the current semantics:
2^32-1
, which is in the u32 range. The ops in question trap if the address is out of this range.u32.split
the address operand, the high bits would be all zero, and the low bits would hold the addressI'm proposing the following changes:
u32.split
.0-3
.Without this change, to perform the equivalent of
mem_load
on the 3rd element of a word, you'd need to issue a sequence of instructions likepadw, push.ADDR, mem_loadw, drop, drop, swap, drop
. With this change, you would usepush.ADDR, add.0x100000003, mem_load
. That's, by my count (if we assume thatmem_load
goes from 1 cycle to 2 cycles to account for the u32 split), an improvement of 6 cycles (10 vs 4) for a basic operation that is going to occur very frequently in programs with any local or global memory allocations.The syntax in MASM could be extended for the version which accepts an immediate, such that the current syntax (e.g.
mem_load.0x0
) is equivalent to specifying a constant offset of 0. This would be extended to accept a constant offset as well (e.g.mem_load.0x0.3
).Potential Issue(s)
The primary issue I can think of, if I'm trying to find ways to break this, is that distinguishing between an out-of-bounds address and a valid address + offset pair, becomes impossible. The case where an out-of-bounds address appears to be an in-bounds address with an invalid offset, isn't really important to distinguish, as both will trap as invalid. However, it is possible, albeit unlikely, for an out-of-bounds address to appear as a valid in-bounds address + offset pair, in which case a load or store that would have previously trapped, will now read/write some arbitrary location in memory.
We could defend against this by bit-packing a 24-bit CRC code in the high-bits after the offset, which must be checked when the offset is non-zero, but whether it is worth imposing such a cost on every load/store really depends on how important it is that we catch such cases. Personally, I think it would be best to specify that the behavior is undefined in such cases, just like there is little that protects you on a more typical system from writing past the end of an array - it might trap, or it might not.
Alternatives
The proposed changes are essential in my opinion, because without them, it is exceedingly awkward to treat memory as word-addressable. The primary alternative is to instead treat memory as felt-addressable. There are a couple of problems with this:
--
I should note that this largely an issue which affects compilers, where fewer (or even no) assumptions can be made about where a given value is located in memory, so a consistent memory model is important. That said, even with hand-written assembly, you'd find yourself in a situation where you are either forced to waste most of a word, or required to emit expensive instruction sequences to read out a word and then extract what you need from it.
The text was updated successfully, but these errors were encountered: