-
Notifications
You must be signed in to change notification settings - Fork 260
Fixing signal instability issues in Jenkins tests #281
Fixing signal instability issues in Jenkins tests #281
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 20 of 21 files at r1, 4 of 4 files at r2, 1 of 1 files at r4, 1 of 1 files at r7.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @chiache, @dimakuv, @rainfld, and @mkow)
LibOS/shim/src/bookkeep/shim_signal.c, line 476 at r7 (raw file):
shim_tcb_t * tcb = SHIM_GET_TLS(); if (!tcb || !tcb->tp || !cur_thread_is_alive()) {
Is there a point where this check could be hoisted to be applied to all upcalls in one place?
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 755 at r7 (raw file):
* @need_locate: Need the location information of the leader. */ static void __discover_ns (bool block, bool need_connect, bool need_locate)
One higher-level question: Can you give an example of a case where it is ok not to block for discovery? It seems like this is pretty critical information in most cases, and you would end up having an error if you don't block to learn who is the leader.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 786 at r7 (raw file):
unlock(cur_process.lock); if (!NS_SEND(findns)(block)) {
Am I correct that the behavior here is that you are always going to send the message to find the leader, you just might not wait for a response?
This might bear a comment.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 787 at r7 (raw file):
if (!NS_SEND(findns)(block)) { wait_for_ipc = !block;
Is this being inverted because the line above was blocking?
Pal/src/host/Linux/db_exception.c, line 194 at r7 (raw file):
uintptr_t rip = uc->uc_mcontext.gregs[REG_RIP]; if (ADDR_IN_PAL(rip)) {
Can you add a comment here about the expected behavior I think this is basically saying that if there is a fault in the PAL to just kill the thread?
Pal/src/host/Linux/db_exception.c, line 216 at r7 (raw file):
PAL_TCB * tcb = get_tcb(); assert(tcb); tcb->pending_event = event_num;
Does this mean we can only have one pending event? Can we set an upper bound on the number of incoming events with some reasonable confidence?
Pal/src/host/Linux/db_exception.c, line 278 at r7 (raw file):
.handler = _DkTerminateSighandler }, [PAL_EVENT_RESUME] = { .signum = { SIGCONT, 0 }, .handler = _DkTerminateSighandler },
Why Terminate?
Pal/src/host/Linux/db_main.c, line 242 at r7 (raw file):
tcb->self = tcb; tcb->handle = first_thread; tcb->alt_stack = alt_stack;
Depending on how this is used, should this also be set to alt_stack - sizeof(PAL_TCB)?
Pal/src/host/Linux/db_misc.c, line 219 at r7 (raw file):
ret = INLINE_SYSCALL(arch_prctl, 2, ARCH_GET_FS, &ret_addr); } else if (reg == PAL_SEGMENT_GS) { // The GS segment is used for the internal TCB of PAL
How confident are we that this won't interfere with applications we want to run?
Pal/src/host/Linux/db_object.c, line 135 at r7 (raw file):
if (count == 1) { if (!handleArray[0])
I don't understand this condition. Is this valid? What is the case where you would call with count =1 and a null handle in entry 0? Perhaps a comment here (or in the function-level comment)?
Pal/src/host/Linux/db_threading.c, line 104 at r7 (raw file):
tcb->self = tcb; tcb->handle = hdl; tcb->alt_stack = alt_stack;
Same comment - 2x check stack math wrt tcb.
Pal/src/host/Linux/Makefile.am, line 11 at r7 (raw file):
CFLAGS = -Wall -fPIC -O2 -std=gnu99 -fgnu89-inline -U_FORTIFY_SOURCE \ -fno-omit-frame-pointer \
How confident are we that we really don't need the frame pointer? Is it really just for this exception handling case?
Pal/src/host/Linux/pal_host.h, line 191 at r7 (raw file):
static __attribute__((unused)) int _DkInternalLock (struct mutex_handle * lock)
Why are we keeping this code if it is unused?
Pal/src/host/Linux-SGX/db_object.c, line 113 at r7 (raw file):
if (count == 1) { if (!handleArray[0])
Same comment as above.
Overall, this is great, and I'm happy to see these fixes. I have some questions about a few of the finer details or code that probably merits a comment or a 2nd look. I do hope we can land these changes soon and move forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 25 of 26 files reviewed, 14 unresolved discussions (waiting on @donporter, @dimakuv, @rainfld, and @mkow)
LibOS/shim/src/bookkeep/shim_signal.c, line 476 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Is there a point where this check could be hoisted to be applied to all upcalls in one place?
You are right. The same check can be made inside deliver_signal() and the semantics will be mostly the same. Allt we need is to stop signals from being delivered to the user program.
I've made the change. Please review.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 755 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
One higher-level question: Can you give an example of a case where it is ok not to block for discovery? It seems like this is pretty critical information in most cases, and you would end up having an error if you don't block to learn who is the leader.
A possible scenario is that a child process asks its parent for the leader and the parent has to ask the grandparent for the information. In this corner case, __discover_ns() is called by the IPC helper thread. Because IPC helper thread cannot be blocking, we simply set up a callback to deliver the answer when it arrives later.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 786 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Am I correct that the behavior here is that you are always going to send the message to find the leader, you just might not wait for a response?
This might bear a comment.
Yes, exactly. If the discovery is non-blocking, the IPC message is sent but does not wait for a response.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 787 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Is this being inverted because the line above was blocking?
Maybe the variable name is confusing. What I mean is, if the IPC call to find the leader is a blocking call, then there is no need to wait for IPC response.
Pal/src/host/Linux/db_exception.c, line 194 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Can you add a comment here about the expected behavior I think this is basically saying that if there is a fault in the PAL to just kill the thread?
Yes, and this is only limited to segfaults, illegal instructions, and div-by-zero. We are trusting that the PAL should not have these exceptions. The assumption is that the LibOS will do the check and guarding to prevent these exceptions from happening in the middle of a PAL call.
Pal/src/host/Linux/db_exception.c, line 216 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Does this mean we can only have one pending event? Can we set an upper bound on the number of incoming events with some reasonable confidence?
That's true. I am not exactly sure how many slots we should preserve for pending events (because it largely depends on workloads). Right now it seems like one is enough for our Jenkins tests, but maybe sometimes multiple events can occur within one PAL call. Another idea is to use a linked list.
Pal/src/host/Linux/db_exception.c, line 278 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Why Terminate?
Because as the change made in _DkGenericSighandler(), the whole process dies if the exception arrives in the middle of a PAL call. In this sense, a resume signal is actually more close to terminate or suspend signals.
Pal/src/host/Linux/db_main.c, line 242 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Depending on how this is used, should this also be set to alt_stack - sizeof(PAL_TCB)?
The confusion is caused by the variable name. Here, alt_stack is actually the bottom of the alternative stack. What's assigned to sigaltstack() is actually tcb
, with the size being tcb - tcb->alt_stack
.
Pal/src/host/Linux/db_misc.c, line 219 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
How confident are we that this won't interfere with applications we want to run?
Very confident. I've never seen any application that uses the GS register.
Pal/src/host/Linux/db_object.c, line 135 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
I don't understand this condition. Is this valid? What is the case where you would call with count =1 and a null handle in entry 0? Perhaps a comment here (or in the function-level comment)?
Yes, it is a valid condition in the PAL ABI, unfortunately. We do allow the elements in handle array to be NULL. The possibility is that the elements were once some handles but later removed (but the caller is too lazy to reconstruct the array). I've seen this bug causing segfaults in some of the Jenkins tests.
I will add a comment later.
Pal/src/host/Linux/db_threading.c, line 104 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Same comment - 2x check stack math wrt tcb.
Same.
Pal/src/host/Linux/Makefile.am, line 11 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
How confident are we that we really don't need the frame pointer? Is it really just for this exception handling case?
Very confident. The whole reason of preserving the frame pointer is that we can trace the top frame for PAL. Now we don't need that anymore.
Pal/src/host/Linux/pal_host.h, line 191 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Why are we keeping this code if it is unused?
Just in case it is not used in some source files, GCC wouldn't raise a warning. The function is still very much used in many source files.
Pal/src/host/Linux-SGX/db_object.c, line 113 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Same comment as above.
Same.
Will add a comment later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 25 of 26 files reviewed, 14 unresolved discussions (waiting on @donporter, @dimakuv, @rainfld, and @mkow)
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 787 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Maybe the variable name is confusing. What I mean is, if the IPC call to find the leader is a blocking call, then there is no need to wait for IPC response.
Still confusing, I guess. What I mean is that the discovery has completed when a blocking IPC call finishes and succeeds. Otherwise, 'wait_for_ipc' is set to true and we can't expect an answer in this corner case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 23 of 26 files reviewed, 14 unresolved discussions (waiting on @donporter, @dimakuv, @rainfld, and @mkow)
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 786 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Yes, exactly. If the discovery is non-blocking, the IPC message is sent but does not wait for a response.
Done.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 787 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Still confusing, I guess. What I mean is that the discovery has completed when a blocking IPC call finishes and succeeds. Otherwise, 'wait_for_ipc' is set to true and we can't expect an answer in this corner case.
Done.
Pal/src/host/Linux/db_exception.c, line 194 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Yes, and this is only limited to segfaults, illegal instructions, and div-by-zero. We are trusting that the PAL should not have these exceptions. The assumption is that the LibOS will do the check and guarding to prevent these exceptions from happening in the middle of a PAL call.
Done.
Test this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r8.
Reviewable status: 25 of 26 files reviewed, 7 unresolved discussions (waiting on @donporter, @chiache, @dimakuv, @rainfld, and @mkow)
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 755 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
A possible scenario is that a child process asks its parent for the leader and the parent has to ask the grandparent for the information. In this corner case, __discover_ns() is called by the IPC helper thread. Because IPC helper thread cannot be blocking, we simply set up a callback to deliver the answer when it arrives later.
Got it. In this case, would it be wise to 1) add an assert that block is true if we aren't the IPC helper thread, and 2) to make sure the caller checks some sort of return value?
Or are you saying that there is code in this function that sets up the callback?
Pal/src/host/Linux/db_exception.c, line 216 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
That's true. I am not exactly sure how many slots we should preserve for pending events (because it largely depends on workloads). Right now it seems like one is enough for our Jenkins tests, but maybe sometimes multiple events can occur within one PAL call. Another idea is to use a linked list.
At a minimum, I might assert that the value is 0 (or whatever value means no pending signal), so that you get a clear error if more than one pending event arrives.
My instinct would be to just pre-allocate a reasonably large array and not over-engineer things, but make sure that, if we are wrong, there is a clear failure message.
Pal/src/host/Linux/db_main.c, line 242 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
The confusion is caused by the variable name. Here, alt_stack is actually the bottom of the alternative stack. What's assigned to sigaltstack() is actually
tcb
, with the size beingtcb - tcb->alt_stack
.
Might add a comment to reassure the reader of the code that the right thing is done elsewhere (and maybe even a pointer to where that happens).
Pal/src/host/Linux/pal_host.h, line 191 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Just in case it is not used in some source files, GCC wouldn't raise a warning. The function is still very much used in many source files.
I see, because it is in a header and not a .c file? Seems like relocating it might be preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 20 of 27 files reviewed, 7 unresolved discussions (waiting on @donporter, @chiache, @dimakuv, @rainfld, and @mkow)
Pal/src/host/Linux/db_main.c, line 242 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Might add a comment to reassure the reader of the code that the right thing is done elsewhere (and maybe even a pointer to where that happens).
Done.
Pal/src/host/Linux/db_object.c, line 135 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Yes, it is a valid condition in the PAL ABI, unfortunately. We do allow the elements in handle array to be NULL. The possibility is that the elements were once some handles but later removed (but the caller is too lazy to reconstruct the array). I've seen this bug causing segfaults in some of the Jenkins tests.
I will add a comment later.
Done.
Pal/src/host/Linux-SGX/db_object.c, line 113 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Same.
Will add a comment later.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 20 of 27 files reviewed, 11 unresolved discussions (waiting on @donporter, @chiache, @dimakuv, @rainfld, and @mkow)
LibOS/shim/src/shim_init.c, line 779 at r8 (raw file):
sizeof(struct newproc_response), &res, NULL)) return -PAL_ERRNO;
Signal delivery is not re-enabled on this path. I guess, given that the whole process is dying anyway, we can ignore re-enabling it.
LibOS/shim/src/shim_init.c, line 811 at r8 (raw file):
Quoted 6 lines of code…
if (cur_tcb->context.sp) restore_context(&cur_tcb->context); if (cur_thread->exec) execute_elf_object(cur_thread->exec, argc, argp, nauxv, auxp);
Here, restore_context()
will bump context.preempt
, and execute_elf_object()
will also bump context.preempt
. Is it possible that both IF statements are true, and we incidentally bump the preempt counter twice? Will this break the signal-delivery semantics?
Pal/src/db_object.c, line 92 at r8 (raw file):
for (int i = 0 ; i < count ; i++) // We modify the caller's handleArray? if (!handleArray[i] && UNKNOWN_HANDLE(handleArray[i]))
Remove bang symbol. Must be if (handleArray[i] && UNKNOWN_HANDLE(handleArray[i]))
Pal/src/host/Linux/db_exception.c, line 278 at r7 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Because as the change made in _DkGenericSighandler(), the whole process dies if the exception arrives in the middle of a PAL call. In this sense, a resume signal is actually more close to terminate or suspend signals.
Isn't it wrong? _DkTerminateSighandler
performs _DkThreadExit
if there is no upcall for RESUME event (SIGCONT signal). Doesn't it mean that the thread will be exited if it didn't register SIGCONT handler (which is a clearly incorrect behavior)?
I think it makes sense to merge _DkTerminateSighandler
and _DkGenericSighandler
into one function, with a couple additional IFs and switches.
Pal/src/host/Linux/db_exception.c, line 167 at r8 (raw file):
} #if BLOCK_SIGFAULT == 1
BTW, can we remove this macro and code? It is not used.
Pal/src/host/Linux/db_exception.c, line 175 at r8 (raw file):
struct ucontext * uc) { #if BLOCK_SIGFUALT == 1
BTW, can we remove this macro and code? It is not used.
Also, the macro name is incorrectly spelled here.
Pal/src/host/Linux/db_exception.c, line 206 at r8 (raw file):
} printf("*** An unexpected %s occured inside PAL. Exiting the thread. "
Can we safely use printf here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 19 of 27 files reviewed, 11 unresolved discussions (waiting on @donporter, @chiache, @dimakuv, @rainfld, and @mkow)
LibOS/shim/src/shim_init.c, line 779 at r8 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Signal delivery is not re-enabled on this path. I guess, given that the whole process is dying anyway, we can ignore re-enabling it.
That's true. Thanks for pointing it out. Should I assume this is a resolved issue?
LibOS/shim/src/shim_init.c, line 811 at r8 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
if (cur_tcb->context.sp) restore_context(&cur_tcb->context); if (cur_thread->exec) execute_elf_object(cur_thread->exec, argc, argp, nauxv, auxp);
Here,
restore_context()
will bumpcontext.preempt
, andexecute_elf_object()
will also bumpcontext.preempt
. Is it possible that both IF statements are true, and we incidentally bump the preempt counter twice? Will this break the signal-delivery semantics?
I believe it's impossible because restore_context()
should never return. But I can add an assertion here to make sure that will never happen.
Pal/src/db_object.c, line 92 at r8 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Remove bang symbol. Must be
if (handleArray[i] && UNKNOWN_HANDLE(handleArray[i]))
Good catch! Will fix this.
Pal/src/host/Linux/db_exception.c, line 206 at r8 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Can we safely use printf here?
It should be safe because there is an alternative stack (printf()
in PAL uses an on-stack buffer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 19 of 27 files reviewed, 11 unresolved discussions (waiting on @donporter, @chiache, @dimakuv, @rainfld, and @mkow)
Pal/src/db_object.c, line 92 at r8 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Good catch! Will fix this.
Done.
Pal/src/host/Linux/db_exception.c, line 167 at r8 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
BTW, can we remove this macro and code? It is not used.
Done.
Pal/src/host/Linux/db_exception.c, line 175 at r8 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
BTW, can we remove this macro and code? It is not used.
Also, the macro name is incorrectly spelled here.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 19 of 27 files reviewed, 17 unresolved discussions (waiting on @donporter, @chiache, @dimakuv, @rainfld, and @mkow)
a discussion (no related file):
SGX PAL has a same way for handling signals with Linux. We might also need to update it. But that could be a separate PR
LibOS/shim/include/shim_internal.h, line 455 at r10 (raw file):
tcb->context.preempt--; //debug("enable preempt: %d\n", tcb->context.preempt & ~SIGNAL_DELAYED); }
The two functions __disable_preempt and __enable_preempt needs to be paired along the path. Currently the code looks fine but the pairs can be easily broken if some modifications are made in future. We might need a way to track if it's paired or not. Maybe we could use something like lock rather than only keeping the count numbers.
LibOS/shim/src/bookkeep/shim_signal.c, line 254 at r10 (raw file):
shim_tcb_t * tcb = SHIM_GET_TLS(); if (tcb && tcb->test_range.cont_addr && arg
Any insights like when tcb could potentially be NULL? I found in a few other locations, "tcb" is not checked after "SHIM_GET_TLS".
LibOS/shim/src/ipc/shim_ipc_helper.c, line 554 at r10 (raw file):
listp_for_each_entry_safe(pobj, n, &pobj_list, list) { __get_ipc_port(pobj); if (pobj->pal_handle && __del_ipc_port(pobj, type))
need to check "pobj->deleted" too? Or it won't happened at runtime?
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 800 at r10 (raw file):
goto out; // If all other ways failed, the current process become the leader
typo: become ->becomes
Pal/src/host/Linux/db_threading.c, line 68 at r10 (raw file):
ss.ss_flags = 0; ss.ss_size = (void *) tcb - tcb->alt_stack;
According to _DkThreadCreate, tcb is below alt_stack, ss.ss_sp should be alt_stack not tcb, and ss.ss_size should be "tcb->alt_stack - (uint64_t*) tcb)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 19 of 27 files reviewed, 17 unresolved discussions (waiting on @donporter, @chiache, @dimakuv, @rainfld, and @mkow)
a discussion (no related file):
Previously, rainfld (Li Lei) wrote…
SGX PAL has a same way for handling signals with Linux. We might also need to update it. But that could be a separate PR
Thanks, that's a good point. I do plan to make another PR for SGX PAL. My observation is that the SGX PAL has less problem with host-raised exceptions because of the unique signal-handling flow of SGX.
Pal/src/host/Linux/db_threading.c, line 68 at r10 (raw file):
Previously, rainfld (Li Lei) wrote…
According to _DkThreadCreate, tcb is below alt_stack, ss.ss_sp should be alt_stack not tcb, and ss.ss_size should be "tcb->alt_stack - (uint64_t*) tcb)
Thanks! That's a very good point. Please see the latest code. The problem should be fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 7 files at r9, 2 of 2 files at r10, 2 of 2 files at r11.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @chiache, @dimakuv, @rainfld, and @mkow)
LibOS/shim/include/shim_internal.h, line 455 at r10 (raw file):
Previously, rainfld (Li Lei) wrote…
The two functions __disable_preempt and __enable_preempt needs to be paired along the path. Currently the code looks fine but the pairs can be easily broken if some modifications are made in future. We might need a way to track if it's paired or not. Maybe we could use something like lock rather than only keeping the count numbers.
Or we could have points, like at the end of a system call, where we assert that preemption is reenabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r12.
Reviewable status: all files reviewed, 17 unresolved discussions (waiting on @donporter, @chiache, @rainfld, and @mkow)
LibOS/shim/src/ipc/shim_ipc_helper.c, line 388 at r12 (raw file):
/* if the port still have other usage, we will not remove the port */ if (port->info.type & ~(type|IPC_PORT_IFPOLL|IPC_PORT_KEEPALIVE)) {
Did you move the two lines above because of this condition? I.e., there was a bug with incorrectly setting port->deleted
even if port has other usages? (Just trying to understand the code.)
LibOS/shim/src/ipc/shim_ipc_helper.c, line 414 at r12 (raw file):
wmb(); /* commit the state to the memory */ // Need to check if there is any pending messages on the port, which means
Typo "is any" -> "are any"
LibOS/shim/src/ipc/shim_ipc_helper.c, line 467 at r12 (raw file):
if (port->info.vmid == vmid) { __get_ipc_port(port); // Bump the refcount to prevent freeing if (!port->deleted && __del_ipc_port(port, type))
It would be easier to read if !port->deleted
is moved to above IF:
if (port->info.vmid == vmid && !port->deleted) ...
LibOS/shim/src/ipc/shim_ipc_helper.c, line 497 at r12 (raw file):
unlock(ipc_helper_lock); if (nfini) {
This IF doesn't make sense, just remove it.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 503 at r12 (raw file):
// We are not holding ipc_helper_lock here, so need to call put_ipc_port() // instead of __put_ipc_port();
This doesn't make sense. put_ipc_port()
is defined as a simple wrapper around __put_ipc_port()
:
void put_ipc_port (struct shim_ipc_port * port)
{
__put_ipc_port(port);
}
LibOS/shim/src/ipc/shim_ipc_helper.c, line 505 at r12 (raw file):
// instead of __put_ipc_port(); put_ipc_port(port); assert(REF_GET(port->ref_count) > 0);
Given this assert, doesn't it mean that port
will not be freed anyway? So we can move put_ipc_port(port)
back under the ipc_helper_lock
lock?
LibOS/shim/src/ipc/shim_ipc_helper.c, line 508 at r12 (raw file):
if (need_restart) restart_ipc_helper(false);
Shouldn't we grab an ipc_helper_lock
again here?
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 755 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
Got it. In this case, would it be wise to 1) add an assert that block is true if we aren't the IPC helper thread, and 2) to make sure the caller checks some sort of return value?
Or are you saying that there is code in this function that sets up the callback?
Only one function NS_CALLBACK(findns)
(in the same file) calls this func with block=false
. This caller function sets up the callback. For me, there is no need to add more asserts here.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 813 at r9 (raw file):
// Finally, set the IPC port as a leadership port add_ipc_port(NS_LEADER->port, 0, IPC_PORT_CLT, &ipc_leader_exit);
Why is the second argument changed from NS_LEADER->vmid
to 0
?
Test this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 21 files at r1, 1 of 4 files at r2, 1 of 1 files at r4, 1 of 3 files at r8, 4 of 7 files at r9, 1 of 2 files at r10, 2 of 2 files at r11, 2 of 2 files at r12, 1 of 1 files at r13.
Reviewable status: all files reviewed, 32 unresolved discussions (waiting on @donporter, @chiache, and @rainfld)
a discussion (no related file):
Previously, chiache (Chia-Che Tsai) wrote…
Thanks, that's a good point. I do plan to make another PR for SGX PAL. My observation is that the SGX PAL has less problem with host-raised exceptions because of the unique signal-handling flow of SGX.
Could you create an issue about this, so we don't forget to fix it?
LibOS/shim/include/shim_internal.h, line 455 at r10 (raw file):
Previously, donporter (Don Porter) wrote…
Or we could have points, like at the end of a system call, where we assert that preemption is reenabled.
IMO Don's method sounds better. Locks sounds like an unnecessary performance hit to me. How exactly do you want to use them here?
LibOS/shim/src/ipc/shim_ipc_helper.c, line 412 at r13 (raw file):
port->deleted = true; /* prevent further usage of the port */ wmb(); /* commit the state to the memory */
Using wmb()
suggests that port->deleted
is used from other threads also, is that true?
If so, this code is not correct - calling wmb()
doesn't make things atomic. port->deleted = true
is a non-atomic operation, moreover, it can be reordered by both the compiler and the CPU.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 417 at r13 (raw file):
// some threads might be blocking for responses. lock(port->msgs_lock); if (!list_empty(port, list)) {
Why do we need this if
? Won't listp_for_each_entry_safe
exit without executing anything if the list is empty?
LibOS/shim/src/ipc/shim_ipc_helper.c, line 445 at r13 (raw file):
__get_ipc_port(port); // Bump the refcount to prevent freeing bool need_restart = __del_ipc_port(port, type); __put_ipc_port(port);
What's the purpose of bumping refcount here? If we already don't own port
then this code is incorrect (another thread can lower refcount to 0 and cause deallocation before we bump it), and if we already own it then this temporal bumping doesn't change anything.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 469 at r13 (raw file):
if (!port->deleted && __del_ipc_port(port, type)) need_restart = true; __put_ipc_port(port);
ditto
LibOS/shim/src/ipc/shim_ipc_helper.c, line 505 at r13 (raw file):
// instead of __put_ipc_port(); put_ipc_port(port); assert(REF_GET(port->ref_count) > 0);
Isn't this a race-condition? We don't have any lock here, what if another thread lowers this refcount?
LibOS/shim/src/ipc/shim_ipc_helper.c, line 556 at r13 (raw file):
if (pobj->pal_handle && __del_ipc_port(pobj, type)) need_restart = true; __put_ipc_port(pobj);
ditto
LibOS/shim/src/ipc/shim_ipc_helper.c, line 1021 at r13 (raw file):
* reading the latest ipc helper state. */ barrier();
ditto: barriers don't make things atomic
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 783 at r13 (raw file):
*/ unlock(cur_process.lock);
This is a very bad pattern. The caller doesn't know that this call (which may be issued somewhere deep in the callstack) frees the lock "for a moment", so there's a window for race conditions.
If the function is blocking then we should expect it to be called without the lock.
Pal/src/pal_defs.h, line 8 at r13 (raw file):
/* (Linux-only) enable caching loaded binaries for optimizing process creation */ #define CACHE_LOADED_BINARIES (defined(LINUX) && 0) /* default: disabled */
uhmm, how does this even compile? There's no such function as defined
and this is not an #if
.
Pal/src/host/Linux/db_exception.c, line 180 at r13 (raw file):
int pid = INLINE_SYSCALL(getpid, 0); int tid = INLINE_SYSCALL(gettid, 0); char * name = "exception";
char
-> const char
Pal/src/host/Linux/db_main.c, line 232 at r13 (raw file):
init_slab_mgr(pagesz); first_thread = malloc(HANDLE_SIZE(thread));
Please check malloc
result here.
Pal/src/host/Linux/db_main.c, line 239 at r13 (raw file):
_DkVirtualMemoryAlloc(&alt_stack, ALT_STACK_SIZE, 0, PAL_PROT_READ|PAL_PROT_WRITE); if (!alt_stack) init_fail(PAL_ERROR_NOMEM, "no alternative thread allocated");
This error message seems quite hard to understand to me. Why not just saying something like "Out of memory when allocating alternative thread stack"?
Pal/src/host/Linux/db_threading.c, line 97 at r13 (raw file):
void * child_stack = stack + THREAD_STACK_SIZE; PAL_HANDLE hdl = malloc(HANDLE_SIZE(thread));
Please check the result of malloc
Pal/src/host/Linux/pal_host.h, line 193 at r13 (raw file):
int _DkInternalLock (struct mutex_handle * lock) { for (volatile int tries = 0 ; _DkMutexLock(lock) < 0 ; tries++);
What's the purpose of tries
here? It's never read anywhere.
Pal/src/host/Linux/pal_linux_defs.h, line 10 at r13 (raw file):
#define USER_ADDRESS_LOWEST 0x10000 #define THREAD_STACK_SIZE PRESET_PAGESIZE * 2
This should be parenthesized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 17 of 30 files reviewed, 34 unresolved discussions (waiting on @donporter, @mkow, @dimakuv, @rainfld, and @chiache)
LibOS/shim/include/shim_internal.h, line 455 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
IMO Don's method sounds better. Locks sounds like an unnecessary performance hit to me. How exactly do you want to use them here?
Good idea. This is done at the code above.
LibOS/shim/include/shim_internal.h, line 216 at r14 (raw file):
handle_signal(false); \ assert(preempt == get_cur_preempt()); \ return ret; \
Checking invariant: the preemption counter should be the same before and after the syscall.
LibOS/shim/include/shim_internal.h, line 405 at r14 (raw file):
*/ #define container_of(ptr, type, field) ((type *)((char *)(ptr) - offsetof(type, field))) #endif
This is to get rid of a warning.
LibOS/shim/src/bookkeep/shim_signal.c, line 254 at r10 (raw file):
Previously, rainfld (Li Lei) wrote…
Any insights like when tcb could potentially be NULL? I found in a few other locations, "tcb" is not checked after "SHIM_GET_TLS".
On second thought, this change is probably unnecessary. TLS is the first thing we set when initializing the SHIM, and at the time, the signal upcalls aren't even set. I have reverted this change.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 388 at r12 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Did you move the two lines above because of this condition? I.e., there was a bug with incorrectly setting
port->deleted
even if port has other usages? (Just trying to understand the code.)
Yes, indeed. The port should only be deleted when there is no other usage.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 414 at r12 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Typo "is any" -> "are any"
Done.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 467 at r12 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
It would be easier to read if
!port->deleted
is moved to above IF:
if (port->info.vmid == vmid && !port->deleted) ...
Done.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 503 at r12 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
This doesn't make sense.
put_ipc_port()
is defined as a simple wrapper around__put_ipc_port()
:void put_ipc_port (struct shim_ipc_port * port) { __put_ipc_port(port); }
That's true. This is probably just following the naming convention. In SHIM, we use ___XXX to specify the lock-holding version of XXX.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 505 at r12 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Given this assert, doesn't it mean that
port
will not be freed anyway? So we can moveput_ipc_port(port)
back under theipc_helper_lock
lock?
Same reference counting issues.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 508 at r12 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Shouldn't we grab an
ipc_helper_lock
again here?
Done.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 417 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Why do we need this
if
? Won'tlistp_for_each_entry_safe
exit without executing anything if the list is empty?
Done.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 445 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
What's the purpose of bumping refcount here? If we already don't own
port
then this code is incorrect (another thread can lower refcount to 0 and cause deallocation before we bump it), and if we already own it then this temporal bumping doesn't change anything.
I do run to the situation that the reference counter decreases to 0 inside __del_ipc_port(). I think there is a bug with reference counting.
Unfortunately, I don't think this can be a problem easy to debug and fix, I plan to revisit this problem after this PR gets merged.
Make sense?
LibOS/shim/src/ipc/shim_ipc_helper.c, line 469 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto
Same. Need to revisit reference counting issues.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 505 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Isn't this a race-condition? We don't have any lock here, what if another thread lowers this refcount?
There is a lock now.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 556 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto
Same reference counting issues.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 1021 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto: barriers don't make things atomic
Hmm, I agree this is a little hackish. The point is not to maintain atomicity. I just want to make sure the code doesn't get re-ordered by the compiler.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 755 at r7 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Only one function
NS_CALLBACK(findns)
(in the same file) calls this func withblock=false
. This caller function sets up the callback. For me, there is no need to add more asserts here.
Agreed with @dimakuv.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 813 at r9 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Why is the second argument changed from
NS_LEADER->vmid
to0
?
Technically, the second argument is the VMID of the other process that this port connects to. Because a leadership port is a listening server, it doesn't really have the other end.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 800 at r10 (raw file):
Previously, rainfld (Li Lei) wrote…
typo: become ->becomes
Done.
LibOS/shim/src/ipc/shim_ipc_nsimpl.h, line 783 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
This is a very bad pattern. The caller doesn't know that this call (which may be issued somewhere deep in the callstack) frees the lock "for a moment", so there's a window for race conditions.
If the function is blocking then we should expect it to be called without the lock.
I've rewritten this code. Please take a look.
Pal/src/pal_defs.h, line 8 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
uhmm, how does this even compile? There's no such function as
defined
and this is not an#if
.
You are right. I am surprised that GCC didn't complain.
Pal/src/host/Linux/db_exception.c, line 216 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
At a minimum, I might assert that the value is 0 (or whatever value means no pending signal), so that you get a clear error if more than one pending event arrives.
My instinct would be to just pre-allocate a reasonably large array and not over-engineer things, but make sure that, if we are wrong, there is a clear failure message.
Done. I took a combined strategy eventually.
Pal/src/host/Linux/db_exception.c, line 278 at r7 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Isn't it wrong?
_DkTerminateSighandler
performs_DkThreadExit
if there is no upcall for RESUME event (SIGCONT signal). Doesn't it mean that the thread will be exited if it didn't register SIGCONT handler (which is a clearly incorrect behavior)?
I think it makes sense to merge_DkTerminateSighandler
and_DkGenericSighandler
into one function, with a couple additional IFs and switches.
You are right. I specified the rule that PAL_EVENT_RESUME should be skipped if no upcall is assigned.
Pal/src/host/Linux/db_exception.c, line 180 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
char
->const char
Done.
Pal/src/host/Linux/db_main.c, line 232 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Please check
malloc
result here.
Done.
Pal/src/host/Linux/db_main.c, line 239 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
This error message seems quite hard to understand to me. Why not just saying something like "Out of memory when allocating alternative thread stack"?
Done.
Pal/src/host/Linux/db_threading.c, line 68 at r10 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Thanks! That's a very good point. Please see the latest code. The problem should be fixed now.
Done.
Pal/src/host/Linux/db_threading.c, line 97 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Please check the result of
malloc
Done.
Pal/src/host/Linux/pal_host.h, line 191 at r7 (raw file):
Previously, donporter (Don Porter) wrote…
I see, because it is in a header and not a .c file? Seems like relocating it might be preferable.
Done.
Pal/src/host/Linux/pal_host.h, line 193 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
What's the purpose of
tries
here? It's never read anywhere.
Done. It's removed (see _DkInternalLock()
in db_mutex.c
.
Pal/src/host/Linux/pal_linux_defs.h, line 10 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
This should be parenthesized
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 17 of 30 files reviewed, 34 unresolved discussions (waiting on @donporter, @mkow, @dimakuv, @rainfld, and @chiache)
a discussion (no related file):
Previously, mkow (Michał Kowalczyk) wrote…
Could you create an issue about this, so we don't forget to fix it?
Will do so.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 554 at r10 (raw file):
Previously, rainfld (Li Lei) wrote…
need to check "pobj->deleted" too? Or it won't happened at runtime?
It is checked at the code above.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 497 at r12 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
This IF doesn't make sense, just remove it.
I changed this code slightly. I think this condition is still helpful here for determining whether to release the lock for calling these functions.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 412 at r13 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Using
wmb()
suggests thatport->deleted
is used from other threads also, is that true?If so, this code is not correct - calling
wmb()
doesn't make things atomic.port->deleted = true
is a non-atomic operation, moreover, it can be reordered by both the compiler and the CPU.
I think I understand what you are saying. For here, I am not really trying to enforce atomicity, but rather just coherence. Once a port is marked as deleted, we need to take a write boundary and make sure every core knows that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 21 files at r1, 1 of 4 files at r2, 3 of 7 files at r9, 1 of 2 files at r10, 12 of 13 files at r14, 1 of 1 files at r15.
Reviewable status: all files reviewed, 34 unresolved discussions (waiting on @donporter, @rainfld, @chiache, @dimakuv, and @mkow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r13.
Reviewable status: all files reviewed, 31 unresolved discussions (waiting on @rainfld, @chiache, @dimakuv, and @mkow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 23 unresolved discussions (waiting on @rainfld, @chiache, and @mkow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 13 files at r14, 1 of 1 files at r15.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @rainfld, @chiache, and @mkow)
a discussion (no related file):
Previously, chiache (Chia-Che Tsai) wrote…
Will do so.
Note: done in #292.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 412 at r13 (raw file):
make sure every core knows that
Doesn't this mean that port->deleted
can be read from other threads also? If not, why do you need coherence where there are no other readers?
LibOS/shim/src/ipc/shim_ipc_helper.c, line 445 at r13 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
I do run to the situation that the reference counter decreases to 0 inside __del_ipc_port(). I think there is a bug with reference counting.
Unfortunately, I don't think this can be a problem easy to debug and fix, I plan to revisit this problem after this PR gets merged.
Make sense?
If __del_ipc_port()
would lower it to 0, then __put_ipc_port(port);
will also do that and cause deallocation. What's the difference? I mean, in both cases the object will be freed.
LibOS/shim/src/ipc/shim_ipc_helper.c, line 505 at r13 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
There is a lock now.
looks better now, but in this version we have the same bad locking pattern as with __discover_ns()
- the lock held by the caller is silently unlocked for a moment. Could you reformat it the same way as you did with __discover_ns()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @chiache and @mkow)
c055393
to
ee766d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r28, 6 of 10 files at r29, 1 of 1 files at r30, 14 of 14 files at r31, 1 of 1 files at r32, 1 of 1 files at r33, 4 of 4 files at r34, 2 of 3 files at r35, 3 of 3 files at r37, 6 of 10 files at r38, 1 of 1 files at r39, 14 of 14 files at r40, 1 of 1 files at r41, 1 of 1 files at r42, 7 of 7 files at r43.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @yamahata and @chiache)
a discussion (no related file):
Previously, donporter (Don Porter) wrote…
I agree in general with smaller PRs, but I am willing to let this one slide if it makes things more stable.
You may want to take a look at #303 and see if we can reenable these tests after this patch lands (with the loop iterations set properly).
I'm also not happy with this PR being so huge, but I think we can make a one-time exception now, because this change has high priority.
a discussion (no related file):
leackage
typo (this is from a commit msg)
LibOS/shim/src/sys/shim_semget.c, line 731 at r35 (raw file):
} #if MIGRATE_SYSV_SEM == 1
In some places you write #if MIGRATE_SYSV_SEM == 1
and in others #if MIGRATE_SYSV_SEM
, could you please unify this? Or maybe even make it simpler:
#ifdef MIGRATE_SYSV_SEM
...
#endif
Pal/src/host/Linux/db_threading.c, line 65 at r26 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
Thanks for pointing this out!
+1 for creating ALIGN_UP(val, divisor)
and ALIGN_DOWN(val, divisor)
macros, more readable and less error-prone ;)
Maybe it would be better to do it in a separate PR? What do you think?
Pal/src/host/Linux/db_threading.c, line 189 at r35 (raw file):
// Free the thread stack INLINE_SYSCALL(munmap, 2, handle->thread.stack, THREAD_STACK_SIZE); // After this line, needs to exit the thread immediately
Hmm, did we just unmapped our own stack here, hoping that the generated code doesn't touch it before exit()? If so, this seems extremely fragile to me.
ee766d0
to
d2b72cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 34 of 35 files reviewed, 7 unresolved discussions (waiting on @mkow, @yamahata, and @chiache)
LibOS/shim/src/sys/shim_semget.c, line 731 at r35 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
In some places you write
#if MIGRATE_SYSV_SEM == 1
and in others#if MIGRATE_SYSV_SEM
, could you please unify this? Or maybe even make it simpler:#ifdef MIGRATE_SYSV_SEM ... #endif
Fixed. Sorry for the typo.
Pal/src/host/Linux/db_threading.c, line 65 at r26 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
+1 for creating
ALIGN_UP(val, divisor)
andALIGN_DOWN(val, divisor)
macros, more readable and less error-prone ;)
Maybe it would be better to do it in a separate PR? What do you think?
We can pull it from another PR of @yamahata.
Pal/src/host/Linux/db_threading.c, line 189 at r35 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Hmm, did we just unmapped our own stack here, hoping that the generated code doesn't touch it before exit()? If so, this seems extremely fragile to me.
Added an issue in #334.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r44.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @yamahata and @chiache)
Pal/src/host/Linux/db_threading.c, line 65 at r26 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
We can pull it from another PR of @yamahata.
link?
d2b72cc
to
90748cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @yamahata, @chiache, and @mkow)
Pal/src/host/Linux/db_threading.c, line 65 at r26 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
link?
#299 defined ALIGN_DOWN_PTR
. We can address this issue in #320 after #299 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @yamahata and @chiache)
Pal/src/host/Linux/db_threading.c, line 65 at r26 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
#299 defined
ALIGN_DOWN_PTR
. We can address this issue in #320 after #299 is merged.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved
a discussion (no related file):
Previously, mkow (Michał Kowalczyk) wrote…
I'm also not happy with this PR being so huge, but I think we can make a one-time exception now, because this change has high priority.
Okay, make one exception and let's move on.
Pal/src/host/Linux/db_threading.c, line 97 at r26 (raw file):
Previously, chiache (Chia-Che Tsai) wrote…
The thread stack is now freed in
_DkThreadExit()
(as the last instruction before exiting the thread).
As a new issue has been filed to track this one. So it's okay with this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved
…mpty(&NS_LEADER->uri)"
[Linux PAL] Keep the thread hanging when exception handling failed
90748cd
to
d31c4d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r46, 6 of 10 files at r47, 1 of 1 files at r48, 14 of 14 files at r49, 1 of 1 files at r50, 1 of 1 files at r51, 7 of 7 files at r52.
Reviewable status: complete! all files reviewed, all discussions resolved
This PR fixes a majority of the instability issues I've seen in the Jenkins tests:
(1) Forbidding signal delivery during thread initialization and termination:
The previous implementation does not forbid a host-raised signal (e.g., SIGINT) or a thread-resuming signal (i.e., SIGCONT) to be delivered before a thread is fully initialized or after a thread is terminated. This bug causes additional upcalls to the user-defined signal handlers even AFTER a thread has died.
For stability, this PR also temporarily disable signal handling during fork(), clone(), and execve(). Signal handling is resumed after the user program resumes execution in the new thread or the new process.
(2) IPC port deletion should wake up blocking threads:
For some IPC calls, a thread needs to block on the IPC port to wait for the response. However, if the IPC helper detects failures on the IPC port, usually being closed by the other end of the port, the IPC port is deleted without notifying these blocking threads.
(3) Rewriting the signal handling convention in PALs:
The PAL design relies on a complex signal handling convention, which needs to rewind the stack to the latest LibOS-owned frame when receiving a signal within PAL. This convention causes lots of problems, including causing deadlocks when signals arrive in the middle of critical sections. The new convention simply postpones signal delivery until the ends of PAL calls. The only caveat is that any blocking operations (such as the futex() system calls) need to check if there is any pending signal when being interrupted by host signals.
(4) (Attempt) to fix an assertion: ipc/shim_ipc_nsimpl.h:991 !qstrempty(&NS_LEADER->uri)
This is supposedly a bug for the
__discover_ns()
routine, which discovers the leadership of a namespace. This assertion mainly occurs during the Bash or GCC regression tests. I am not completely sure about the root cause of this problem, but at least this PR has made the bug disappear OR will help us debug the problem in the future. The main changes in __discover_ns() are two:__discover_ns()
does not re-grabcur_process.lock
after the IPC-based recovery has succeeded (we release the lock temporarily to prevent deadlocks).__discover_ns()
, when the current process just assumes itself as the leader, set the VMID of the IPC port where the leader listens to as 0. First of all, in this case, we don't have to know who's the other end of the IPC port. Second, we normally will check if the IPC port is connecting to another process when adding the IPC port to the listening list.__discover_ns()
returns.This change is