-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mono][interp] Reduce false pinning from interp stack #100400
Conversation
697faf5
to
fa185fc
Compare
|
||
int slot_index = 0; | ||
for (gpointer *p = (gpointer*)context->stack_start; p < (gpointer*)context->stack_pointer; p++) { | ||
if (context->no_ref_slots && (context->no_ref_slots [slot_index / 8] & (1 << (slot_index % 8)))) | ||
;// This slot is marked as no ref, we don't scan it | ||
else | ||
func (p, gc_data); | ||
slot_index++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just summarizing so I understand how this all works.
The way this works is that sgen calls interp_mark_stack
twice: once during the conservative phase and once during the precise phase (if the precise phase is enabled). This code runs during the conservative phase. The no_ref_slots
bits are conservative: we're either sure there's definitely no managed pointer in the slot, or we're not sure what's in the slot. This is because sometimes we push managed pointers with MONO_TYPE_I
.
So if we're sure the slot isn't a pointer, we don't scan it at all. otherwise if we're not sure - we scan it, and possibly create false pinning.
So it's not important that this PR precisely tracks every single managed pointer opcode. But if we see a slot that can't possibly contain a managed pointer, we may mark it and potentially avoid some false pinning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @cshung FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add for clarity that the ambiguity of MONO_TYPE_I
is not really that relevant to the pinning story. Let's say we have two vars, one int32 and one of type object that are allocated at the same offset, since the liveness doesn't interesect. Because we mark these ref bits for the whole scope of the method, the offset will be marked as potentially containing a ref. At the moment of suspend, in theory, we could tell exactly whether the int32 var or the object var is alive at the moment of execution, or maybe none of them. This would probably reduce most of the remaining false pinning.
src/mono/mono/mini/interp/interp.c
Outdated
@@ -412,6 +412,8 @@ get_context (void) | |||
if (context == NULL) { | |||
context = g_new0 (ThreadContext, 1); | |||
context->stack_start = (guchar*)mono_valloc_aligned (INTERP_STACK_SIZE, MINT_STACK_ALIGNMENT, MONO_MMAP_READ | MONO_MMAP_WRITE, MONO_MEM_ACCOUNT_INTERP_STACK); | |||
// A bit for every pointer sized slot in the stack. FIXME don't allocate whole bit array | |||
context->no_ref_slots = (guchar*)mono_valloc (NULL, INTERP_STACK_SIZE / (8 * sizeof (gpointer)), MONO_MMAP_READ | MONO_MMAP_WRITE, MONO_MEM_ACCOUNT_INTERP_STACK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is INTERP_STACK_SIZE
the max stack size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is 1MB statically allocated and never changed / increased.
gsize global_slot_index = current - (gpointer*)context->stack_start; | ||
gsize table_index = global_slot_index / 8; | ||
int bit_index = global_slot_index % 8; | ||
context->no_ref_slots [table_index] |= 1 << bit_index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use mono_bitset_set_fast
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but the use case was simple enough that I went with the manual implementation.
guint32 old_size = td->ref_slots ? (guint32)td->ref_slots->size : 0; | ||
guint32 new_size = old_size ? old_size * 2 : 32; | ||
|
||
gpointer mem = mono_mempool_alloc0 (td->mempool, mono_bitset_alloc_size (new_size, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you decide whether to use mono_mem_manager_alloc0
or mono_mem_manager_alloc0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
td->mempool
is a mempool used by interpreter compiler for any temporary data used during compilation. The mempool is destroyed once method is finished compiling. mono_mem_manager_alloc0
is used for memory used during execution, so it remains alive for the entire duration of the application.
Interpreter opcodes operate on the interp stack, an area of memory separately allocated. Each interp var will have an allocated stack offset in the current interpreter stack frame. When we allocate the storage for an interp var we can take into account the var type. If the type can represent a potential ref to an object or an interior ref then we mark the pointer slot as potentially containing refs, for the method that is being compiled. During GC, we used to conservatively scan the entire interp stack space used by each thread. After this change, in the first stage, we do a stack walkwhere we detect slots in each interp frame where no refs can reside. We mark these slots in a bit array. Afterwards we conservatively scan the interp stack of the thread, while ignoring slots that were previously marked as not containing any refs. System.Runtime.Tests suite was used for testing the effectiveness of the change, by computing the cumulative number of pinned objects throughout all GCs (about 1100). minijit - avg 702000 pinned objects old-interp - avg 641000 pinned objects precise-interp - avg 578000 pinned objects This resulted in 10% reduction in the number of pinned objects during collection. This change is meant to reduce memory usage of apps by making objects die earlier. We could further improve by being more precise. For example, for call sites we could reuse liveness information to precisely know which slots actually contain refs. This is a bit more complex to implement and it is unclear yet how impactful it would be.
A lot of times, when we were pushing a byref type on the stack during compilation, we would first get the mint_type which would be MINT_TYPE_I4/I8. From the mint_type we would then obtain the STACK_TYPE_I4/I8, losing information because it should have been STACK_TYPE_MP. Because of this, the underlying interp var would end up being created as MONO_TYPE_I4/I8 instead of MONO_TYPE_I. Add another method for pushing directly a MonoType, with less confusing indirections. Code around here could further be refactored. This is only relevant for GC stack scanning, since we would want to scan only slots containing MONO_TYPE_I.
fa42813
to
7efd380
Compare
* [mono][interp] Reduce false pinning from interp stack Interpreter opcodes operate on the interp stack, an area of memory separately allocated. Each interp var will have an allocated stack offset in the current interpreter stack frame. When we allocate the storage for an interp var we can take into account the var type. If the type can represent a potential ref to an object or an interior ref then we mark the pointer slot as potentially containing refs, for the method that is being compiled. During GC, we used to conservatively scan the entire interp stack space used by each thread. After this change, in the first stage, we do a stack walkwhere we detect slots in each interp frame where no refs can reside. We mark these slots in a bit array. Afterwards we conservatively scan the interp stack of the thread, while ignoring slots that were previously marked as not containing any refs. System.Runtime.Tests suite was used for testing the effectiveness of the change, by computing the cumulative number of pinned objects throughout all GCs (about 1100). minijit - avg 702000 pinned objects old-interp - avg 641000 pinned objects precise-interp - avg 578000 pinned objects This resulted in 10% reduction in the number of pinned objects during collection. This change is meant to reduce memory usage of apps by making objects die earlier. We could further improve by being more precise. For example, for call sites we could reuse liveness information to precisely know which slots actually contain refs. This is a bit more complex to implement and it is unclear yet how impactful it would be. * [mono][interp] Add option to disable precise scanning of stack * [mono][interp] Fix pushing of byrefs on execution stack A lot of times, when we were pushing a byref type on the stack during compilation, we would first get the mint_type which would be MINT_TYPE_I4/I8. From the mint_type we would then obtain the STACK_TYPE_I4/I8, losing information because it should have been STACK_TYPE_MP. Because of this, the underlying interp var would end up being created as MONO_TYPE_I4/I8 instead of MONO_TYPE_I. Add another method for pushing directly a MonoType, with less confusing indirections. Code around here could further be refactored. This is only relevant for GC stack scanning, since we would want to scan only slots containing MONO_TYPE_I.
Interpreter opcodes operate on the interp stack, an area of memory separately allocated. Each interp var will have an allocated stack offset in the current interpreter stack frame. When we allocate the storage for an interp var we can take into account the var type. If the type can represent a potential ref to an object or an interior ref then we mark the pointer slot as potentially containing refs, for the method that is being compiled.
During GC, we used to conservatively scan the entire interp stack space used by each thread. After this change, in the first stage, we do a stack walkwhere we detect slots in each interp frame where no refs can reside. We mark these slots in a bit array. Afterwards we conservatively scan the interp stack of the thread, while ignoring slots that were previously marked as not containing any refs.
System.Runtime.Tests suite was used for testing the effectiveness of the change, by computing the cumulative number of pinned objects throughout all GCs (about 1100).
minijit - avg 702000 pinned objects
old-interp - avg 641000 pinned objects
precise-interp - avg 578000 pinned objects
This resulted in 10% reduction in the number of pinned objects during collection. This change is meant to reduce memory usage of apps by making objects die earlier. We could further improve by being more precise. For example, for call sites we could reuse liveness information to precisely know which slots actually contain refs. This is a bit more complex to implement and it is unclear yet how impactful it would be.