-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64 multicore support #10972
Arm64 multicore support #10972
Conversation
My daily driver is arm64 so this PR makes me very happy ^_^ |
@EduardoRFS , you need to look into this... |
@ctk21 I didn't knew you were working on this thank you so much, I was late to the party, this PR is quite close to what I had, so we can just follow here. Will be reviewing it in the near future. About the
Yeah I was having a hard time on this, is it documented somewhere how all the magic works? |
We need the PR #10943 for correct compilation of memory model. Then,
See table 5 https://kcsrk.info/papers/pldi18-memory.pdf. Atomic stores are compiled as external calls, and hence, nothing needs to be done there. |
There's no documentation on this currently except for the comments in |
Except initializing stores Most stores are initializing stores, so it makes a big difference in code size. |
Indeed. The barrier is not needed for initialising stores. |
For non-initializing non-atomic store did you not decide for |
Yes, I believe so. We need a release store for the publication safety of initialising stores.
I don't know enough about the translation to C11 memory model. Perhaps @stedolan can say more. |
#10943 is merged now. Thanks for merging, @gasche. While we are reviewing Arm64, I'd like to see the codegen side of the memory model discussed and implemented in this PR as well. The codegen changes are small and IINM contained to the emission of appropriate instructions for non-atomic assignment (but not initialising) It would also be great if we can have eyes on |
I don't think let r1 = ref [||]
let t1 () = r1 := [| "Hello, world" |] (* thread 1 *)
let t2 () = print_endline (!r1.(0)) (* thread 2 *) It would be bad (crash!) if Lines 2212 to 2225 in ebb9f0d
|
If you know that all potentially-publishing stores of blocks necessarily go though |
Sorry for forcing people to work more on this, but I'd like to ask for an explanation of "publication safety" that non-experts like me can follow. (I think this is interesting in particular because arm64 is the first non-TSO backend to be merged, and the discussions here may inform other backends in the future.) caml_modify first performs an "acquire fence" and then a "release store/write" in the modified location. There is a comment in memory.c that explains why this guarantees memory-safety, but I cannot check that I agree with what it says, and or with what's going on here. In my naive understanding of memory models, release stores synchronize with acquire loads (when the load sees the result of the store), and this guarantees that any writes (atomic or not) seen by the release store will also be seen after the acquire load. (The C memory model are also "consume" loads with weaker guarantees.) But in the example of @kayceesrk above, the read of the reference (Is this related to the fence? I'm not fully clear on what these fences are doing, but my current understanding is that they ensure that if a thread sees the write to |
Our messages crossed, I have guessed that this relies on Arm's dependency ordering (cf. consume ordering which you mentioned, which was never implemented as intended in C compilers). But this opens more questions, for which I am preparing an issue. |
The key point here is that there is an address dependency between the two loads on Here is a broken message passing example constructed only using non-atomic locations for comparison: let msg = ref 0 and flag = ref None
let t1 () =
msg := 1;
flag := Some 42
let t2 () =
let rf = !flag in
let rm = !msg in
assert (not (rf = Some 42 && rm = 0)) (* may fail *) The write to [1] Limits on ordering, https://developer.arm.com/documentation/102376/0100/Normal-memory |
I must admit that I'm not familiar with the axiomatic memory model from the paper (the operational model is simple and nice, but it doesn't say anything here as we are mixing atomic and non-atomic accesses on the same location); I haven't had the brain surgery yet that makes people comfortable with axiomatic memory models. If I understand correctly, your reply basically means that on Arm, non-atomic loads behave like "consume" loads when they synchronize with "release" stores? (I mentioned that compiling all (non-atomic) reads of mutable locations to acquire loads would be safe; in fact I guess that having non-atomic reads of mutable locations act as "consume" loads suffices for this problem of block value initialization, as all potentially-problematic loads are dependent on this load.) |
I believe so. In fact, I now notice that the behaviour in this example is exactly the "publisher-subscriber situations with pointer-mediated publication" in the description in [1]. But the document also says compilers promote consume to acquire, and discourages the use of consume ordering. Hence, I think it may be better to reason about the mixed-mode behaviours in terms of the hardware model and not C++11. [1] https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering |
This conversation has been going above my head for a while already, but:
The discussion is all about OCaml nonatomic loads and stores, as far as I can see. (Actually, static typing prevents mixing atomic and nonatomic accesses on the same location, again as far as I can see.) So, the models from the paper should tell you which behaviors are allowed. Then, all you need is @maranget - level understanding of the ARMv8 memory model to check that the proposed implementation doesn't allow more behaviors than that. |
I understand the question as mixing OCaml and C11 models. As far as I understand , in the specific scenario, one wishes ordering from one read to another, when the second read address depends on the first read value, a.k.a. address dependency. The difficulty originates from coding the read-to-read sequence in C. Armv8 leaves address dependencies alone, i.e no fence needed here, nor load-acquire for performing the first read. But in C?
As a conclusion the cast to volatile on pointers to aligned values does not incur much penalty, if any. But well this is not C11! (*) Well Consume is here to enforce address dependency as an order, for DEC alpha, which does not provide it by default. |
Again, this goes above my head, but: there is only one piece of C code involved in the discussion as far as I can see, namely |
Agree with Xavier that the only C code here is in |
The confusion is a bit of my fault, I had a draft pointing out two issues with this dependency ordering, including the one with C code. I was proposing that the |
Thank you for bits of the missing context. I'd like to recenter the memory model discussion on this PR. If I understand correctly, here is the proposed implementation of the Multicore OCaml memory model on AArch64:
Note: the last line is the code produced by GCC 11.2 for the "acquire fence; release store" sequence in Is everyone comfortable with this implementation? Especially the use of a plain Please try to give clear answers without hidden context. [ Edited to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the assembly code (runtime/arm64.S) up to the Fibers part. The code is actually quite easy to read, in part because of judicious use of GAS macros. Congratulations!
I did not spot any issues in this part of runtime/arm64.S, just suggestions for clarifications and minor simplifications in proc.ml and emit.mlp.
1: RESTORE_ALL_REGS | ||
/* Free stack space and return to caller */ | ||
ldp x29, x30, [sp], 16 | ||
add sp, sp, 16 /* pop argument */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this SP adjustment is needed so that the backtrace is properly recorded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we want the stack to be well formed when we call caml_raise_exn
in case a backtrace is needed.
asmcomp/arm64/proc.ml
Outdated
(* x20-x28, d8-d15 preserved *) | ||
Array.of_list (List.map phys_reg | ||
[0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15; | ||
[0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a minor change to arm64.S, we might be able to avoid destroying x19, in which case this change can be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So using the frame pointer to avoid destroying x19
is something I think can work by changing the emit for Iextcall
in the case of non-allocating C calls.
Let me try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e7b3839 implements Iextcall
for non-allocating C calls using the frame pointer register to hold the OCaml stack.
If we end up having frame pointer support on the OCaml frames in arm64, then I believe we could leave x29
alone and then restore the OCaml stack using x29
and env.stack_offset
. So I think this doesn't back us into a corner there.
Overall I like this change; it gives back x19
to user code.
runtime/memory.c
Outdated
@@ -234,7 +234,9 @@ CAMLexport int caml_atomic_cas_field ( | |||
} else { | |||
/* need a real CAS */ | |||
atomic_value* p = &Op_atomic_val(obj)[field]; | |||
if (atomic_compare_exchange_strong(p, &oldval, newval)) { | |||
int cas_ret = atomic_compare_exchange_strong(p, &oldval, newval); | |||
atomic_thread_fence(memory_order_release); /* generates `dmb ish` */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again limiting my review to the memory model part, here is a minor nit. These C functions are not specific to Arm, and a lot more is needed to explain the hybrid scheme they implement now, so these comments are out of context. It is important to explain why these fences are there, but perhaps it is better to have it as part of the upcoming memory model documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be satisfied if the comment says
generates
dmb ish
on Arm64
Of course, the comment by itself is not illustrative unless accompanied by a memory model documentation, but will be correct. Once the documentation is available, we can update the comment to point to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever works for you, it was just an advice. It's fine to clarify it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed the arm64.S
file, and it looks correct to me. I am also fairly confident about the coverage of the tests in testsuite/tests/effects
and the effect handlers tests in testsuite/tests/callback
. They were written to comprehensively test each of the different transitions between C and OCaml, and fibers. We have also reviewed the memory model aspects of the Arm64 compilation quite thoroughly here. Based on these, I am approving this PR.
The DWARF stack unwinding work should be done in a separate PR as the original PR message says. Separately, there is also work to document the low-level implementation details of fibers.
@xavierleroy I'm wondering what else do we need to get this PR across the line? Once the PR is at a stage where it doesn't need more work, @ctk21 promised to make the commit history sensible and add a CHANGES
entry.
I agree this PR is in good shape and can be merged. I'll be happy if @ctk21 can clean up the history a bit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the memory model related changes, I was not capable of doing a proper review, so I made a superficial review to arm64.S
and it looks okay, there is a lot of small improvements to do(mostly shaving instructions away), but in general it looks sound and quite similar to amd64.S
and to what I was writing.
Thank you so much @ctk21
Many thanks for the reviews. |
…loc Iextcall; s/max_stack_size/max_frame_size/ in {arm64/amd64}/emit.mlp; refactor out preproc_stack_check to emitaux.ml to share between {arm64,amd64}/emit.mlp; handle Ladjust_trap_depth in preproc_stack_check; exhaustive match in preproc_stack_check
…ly handle ALLOC_PTR & TRAP_PTR in {SAVE,RESTORE}_ALL_REGS ARM64: remove dead commented out code in caml_c_call
Make sure atomic loads use the trivial addressing mode; the `ldar` instruction does not support fancy addressing. ARM64 assembler generation - Non-initializing stores can use "str" and need a barrier only at Word_int and Word_val types. (For smaller or bigger types, the guarantees of the Multicore OCaml memory model do not apply anyway.) - Atomic loads are performed only at Word_int and Word_val types, and do not need a barrier in addition to the "ldar" instruction. Add casts to ensure proper typing of printf format Add atomic_thread_fence(memory_order_release) to ensure 'dmb ish' for caml_atomic_exchange, caml_atomic_cas_field, caml_atomic_cas and caml_atomic_fetch_add Co-authored-by: Xavier Leroy <[email protected]>
remove unnecessary zero of backtrace_pos in emit for Raise_regular; fixes for comments and minor tweaks; CFI_OFFSET fixes; use clearer ldr with Cstack_prev in caml_start_program; utilize the frame pointer to hold the OCaml stack over non-allocating Iextcall returning x19 as a user saved register Co-authored-by: Xavier Leroy <[email protected]>
4b1728c
to
e4943bf
Compare
I have updated the Changes entry and cleaned up the commits (if you squint, it might represent the PR history). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ready for merging ! Thanks a lot @ctk21 and all the reviewers.
This PR implements the assembler, proc and emit changes needed to get ARM64 working. I have tested it using the compiler testsuite on MacOS/M1 and a Linux/Graviton2.
What this PR intends to do:
Iload
andIstore
to implement the OCaml memory model in ARMv8.What this PR doesn't do:
I have not spent quality time plumbing through the memory model requirements; I've limited the exercise to getting the assembler up and the testsuite going. My thinking was to handle the memory model in a follow on.I'm keen to get feedback, the assembler is very much: "get it to run". There may be beneficial refactorings or ARM64 idioms (or <gasp> bugs not exercised by our testsuite) that other eyes will see.
[edited to reflect the PR now incorporates changes in emit for the memory model]