Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
lazy stack: inline assembly to pre-fault stack
This is the 1st of the total of eight patches that implement optional support of so called "lazy stack" feature. The lazy stack is well explained by the issue #143 and allows to save substantial amount of memory if application spawns many pthreads with large stacks by letting stack grow dynamically as needed instead of getting pre-populated ahead of time. The crux of this solution and the previous versions is based on the observation that OSv memory fault handler requires that both interrupts and preemption must be enabled when fault is triggered. Therefore if stack is dynamically mapped we need to make sure that stack page faults NEVER happen in the relatively few places of kernel code that executes with either interrupts or preemption disabled. And we satisfy this requirement by "pre-faulting" the stack by reading a byte page (4096) down per stack pointer just before preemption or interrupts are disabled. Now, the problem is complicated by the fact that the kernel code A that disables preemption or interrupts may nest by calling another kernel function B that also disables preemption or interrupts in which case the function B should NOT try to pre-fault the stack otherwise the fault handler will abort due to violated constraint stated before. In short we cannot "blindly" or unconditionally pre-fault the stack in all places before interrupts or preemption are disabled. Some of the previous solutions modified both arch::irq_disable() and sched::preempt_disable() to check if both preemption and interrupts are enabled and only then try to read a byte at -4096 offset down. Unfortunately, this makes it more costly than envisioned by Nadav Har'El - instead of single instruction to read from memory, compiler needs 4-5 to read data if preemption and interrupts are enabled and perform relevant jump. To make it worse, the implementation of arch::irq_enabled() is pretty expensive at least in x64 and uses stack with pushfq. To avoid it the previous solutions would add new thread local counter and pack irq disabling counter together with preemption one. But even with this optimization I found this approach to deteriorate the performance quite substantially. For example the memory allocation logic disables preemption in quite many places (see core/mempool.cc) and corresponding test - misc-free-perf.cc - would show performance - number of malloc()/free() executed per second - degrade on average by 15-20%. So this latest version implemented by this and next 7 patches takes different approach. Instead of putting the conditional pre-faulting of stack in both irq_disable() and preempt_disable(), we analyze OSv code to find all places where irq_disable() and/or preempt_disable() is called directly (or indirectly sometimes) and pre-fault the stack there if necessary or not. This makes it obviously way more laborious and prone to human error (we can miss some places), but would make it way more performant (no noticable performance degradation noticed) comparing to earlier versions described in the paragraph above. As we analyze all call sites, we need to make some observations to help us decide what exactly to do in each case: - do nothing - blindly pre-fault the stack (single instruction) - conditionally pre-fault the stack (hopefully in very few places) Rule 1: Do nothing if call site in question executes ALWAYS in kernel thread. Rule 2: Do nothing if call site executes on populated stack - includes the above but also code executing on interrupt, exception or syscall stack. Rule 3: Do nothing if call site executes when we know that either interrupts or preemption are disabled. Good example is an interrupt handler or code within WITH_LOCK(irq_lock) or WITH_LOCK(preemption_lock) blocks. Rule 4: Pre-fault unconditionally if we know that BOTH preemption and interrupts are enabled. These in most cases can only be deduced by analysing where the particular function is called. In general any such function called by user code like libc would satisfy the condition. But sometimes it is tricky because kernel might be calling libc functions, such as malloc(). Rule 5: Otherwise pre-fault stack by determining dynamically: only if sched::preemptable() and irq::enabled() One general rule is that all potential stack page faults would happen on application thread stack when some kernel code gets executed down the call stack. In general we identify the call sites in following categories: - direct calls to arch::irq_disable() and arch::irq_disable_notrace() (tracepoints) - direct calls to sched::preempt_disable() - code using WITH_LOCK() with instance of irq_lock_type or irq_save_lock_type - code using WITH_LOCK(preempt_lock) - code using WITH_LOCK(osv::rcu_read_lock) The above locations can be found with simple grep but also with an IDE like CLion from JetBrains that can help more efficiently find all direct but also more importantly indirect usages of the call sites identified above. So this patch lays a ground work by defining the inline assembly to pre-fault the stack where necessary and introduces two build parameters - CONF_lazy_stack and CONF_lazy_stack_invariant - that are disabled by default. The first one is used in all places to enable the lazy stack logic and the second one is used to add code with some related invariants that will help us to reason about the code and whether we should do nothing, pre-fault stack "blindly" or conditionally. The remaining 7 patches mostly add the pre-fault code in relevant places but also annotate code with some invariants using assert(). Signed-off-by: Waldemar Kozaczuk <[email protected]>
- Loading branch information