Conversation
This PR merges threading of object fields with marking. The result is
-14% in instructions in CanCan backend benchmark.
Changes:
- Mark phase of compacting GC now threads fields of marked objects.
After marking all fields are now threaded.
- Because all fields are threaded during marking, the pass for threading
is now gone.
- However we still need two passes for unthreading and moving the
objects (see below). So the first pass after marking unthreads fields.
Then another pass moves objects.
Original idea was to merge marking and threading, and doing unthreading +
moving in one pass. It turns out that doesn't work. To see why, suppose the
heap is like this:
| ..., object N, ..., object M, ... |
where N and M point to each other. If we unthread + move (slide) in one pass,
then after sliding N, it's possible that live objects between N and M will be
slid to N's original location. So when we unthread M which will have a pointer
to N's old location in its header we'll read random data from whatever's moved
to N's original location.
The solution is to do these in different passes: first unthread, then slide.
The result is still 3 passes as before, but for some reason that I don't
understand, this version is 14% faster on CanCan backend benchmark (42,005,245
instructions to 35,775,591).
I think "An Efficient Garbage Compaction Algorithm" is describing a way to
implement in-sliding compaction in two passes instead of three, but I found
that paper difficult to follow. I will revisit it now to see if this can be
improved further.
| heap.heap_base_offset(), | ||
| heap.heap_ptr_offset(), | ||
| heap.closure_table_ptr_offset(), | ||
| ); |
There was a problem hiding this comment.
I removed this as it doesn't allow garbage in heap (as we don't want garbage after a GC). I should probably modify this to take a "post GC" parameter and check for garbage depending on that.
|
This PR does not affect the produced WebAssembly code. |
|
|
||
| for obj in &objs { | ||
| push_mark_stack(mem, *obj as usize); | ||
| push_mark_stack(mem, *obj as usize, obj.wrapping_sub(1)); |
There was a problem hiding this comment.
Not really, I just wanted a value derived from obj for the tag argument of push_mark_stack. Since the actual value does not matter (push/pop does not care about validness of tags) I more or less randomly used obj - 1.
Also explained this a little bit below. Do you have any ideas on how to update this test better?
There was a problem hiding this comment.
Oh, never mind, I didn't realize this was just dummy data.
There was a problem hiding this comment.
(should have looked at the context)
| for obj in objs.iter().copied().rev() { | ||
| let popped = pop_mark_stack(); | ||
| if popped != Some(*obj as usize) { | ||
| if popped != Some((obj as usize, obj.wrapping_sub(1))) { |
|
|
||
| for obj in &objs { | ||
| push_mark_stack(mem, *obj as usize); | ||
| push_mark_stack(mem, *obj as usize, obj.wrapping_sub(1)); |
There was a problem hiding this comment.
Given the signature (from below)
pub unsafe fn push_mark_stack<M: Memory>(mem: &mut M, obj: usize, obj_tag: Tag)
Aren't the arguments above the wrong way round? Or is this a different push_mark_stack?
I.e. shouldn't this call be:
push_mark_stack(mem, obj.wrapping_sub(1), *obj as usize);
There was a problem hiding this comment.
Is Tag just an abbreviation for usize? If it is, might be nice to make them distinct types.
There was a problem hiding this comment.
So this part is a bit hacky. Now that push needs a pointer + tag (u32) I needed to come up with something for the tag here. The mark stack doesn't care about the tag so just randomly used value - 1. I could use a constant, but I thought perhaps that hides bugs.
Is Tag just an abbreviation for usize? If it is, might be nice to make them distinct types.
Yeah I think making them distinct makes sense. I'll do that in a separate PR.
|
|
||
| pub unsafe fn push_mark_stack<M: Memory>(mem: &mut M, obj: usize) { | ||
| pub unsafe fn push_mark_stack<M: Memory>(mem: &mut M, obj: usize, obj_tag: Tag) { | ||
| // We add 2 words in a push, and `STACK_PTR` and `STACK_TOP` are both multiples of 2, so we can |
There was a problem hiding this comment.
Is the size of the mark stack ever an issue? If so, would it be worth compressing the 4-tag bits into something smaller than a word or word alignment more important?
There was a problem hiding this comment.
Is the size of the mark stack ever an issue?
I had some benchmarks on mark stack sizes when I first implemented this GC, but those benchmarks were not realistic (more like micro benchmarks) so I should run that again on CanCan. In the worst case stack size will be heap size / 2 (one object points to every object in the heap), but that case never happens.
I was thinking about this and I think I have an idea. Previously the algorithm was something like this at a high level:
With this PR:
I think the difference in performance is because with this PR we visit object fields once (to push fields to the mark stack and thread), instead of twice as before (once to push fields to the mark stack, once again to thread). This can be seen in the diff if you search for As mentioned in the PR description, "An Efficient Garbage Compaction Algorithm" describes a way to do this in two passes, but I found the paper difficult to follow. @crusso mentioned "High-Performance Garbage Collection for Memory-Constrained Environments" to me (thanks @crusso!) which describes a compaction algorithm that works in two passes in section 5.1.2. I don't know if it's the same as "An Efficient Garbage Collection Algorithm" or not, but it's quite simple and I think should improve our GC even more. Here's how it works: during marking we only thread backward pointers and skip forward pointers. So in our original example from PR description: during marking we only thread the field of B that points to A, not the field of A that points to B. Now the second pass is similar, we scan the heap from start to end. When we see A we unthread it and update the pointers to A's final location, then move A, then thread its forward pointers. Now when we reach B we can safely unthread it as its header will be readable (i.e. won't be overwritten by objects between A and B). It's quite smart and should improve this PR even more. I will implement this next week. |
Co-authored-by: Claudio Russo <claudio@dfinity.org>
Actually I'm not sure about this, because we will have to visit object fields twice and the -14% will be lost. I will implement it in a new PR so that we can compare. |
| } | ||
|
|
||
| pub unsafe fn pop_mark_stack() -> Option<usize> { | ||
| pub unsafe fn pop_mark_stack() -> Option<(usize, Tag)> { |
There was a problem hiding this comment.
I guess you could return a sentinal, not an option, for perf, but perhaps not worth the effort.
There was a problem hiding this comment.
Maybe Rust even does that automatically? It does for some types, but the list doens't mention tuples :https://doc.rust-lang.org/std/option/#representation
There was a problem hiding this comment.
|
Last commit removes a redundant pass over roots so it should improve perf even more. |
Bench reports -2.7% in instructions
|
After all the optimizations current status is -17.15% in instructions compared to master. |
This PR implements the mark-compact GC algorithm briefly described by "High-Performance Garbage Collection for Memory-Constrained Environments" section 5.1.2 (I think the original paper is "An Efficient Garbage Compaction Algorithm" but that paper is a bit difficult to follow). The idea is as follows: when marking an object we thread backwards pointers of that object. After we linearly scan the heap as before. For a live object, we unthread it and update backwards pointers to it to the objects new location, move the object, and then thread its forwards pointers. After the pass all objects are moved and all references are updated. This reduces number of passes in the original mark-compact collector to 2. CanCan backend benchmark reports -19.8% in instructions. Compared to copying GC, mark-compact GC is now 36% slower, instead of 70% as before. This PR also optimizes `mark_static_roots` by inlining `mark_fields` in the body and specializing it for `MutBox`: static root array only points to static `MutBox`es so no need to call more general `mark_fields`. This optimization gives us approximately -2% in instructions. See also #2641 for another, slower variant of the algorithm.
This PR merges threading of object fields with marking. The result is
-14% in instructions in CanCan backend benchmark.
Changes:
Mark phase of compacting GC now threads fields of marked objects.
After marking all fields are now threaded.
Because all fields are threaded during marking, the pass for threading
is now gone.
However we still need two passes for unthreading and moving the
objects (see below). So the first pass after marking unthreads fields.
Then another pass moves objects.
Original idea was to merge marking and threading, and doing unthreading +
moving in one pass. It turns out that doesn't work. To see why, suppose the
heap is like this:
where A and B point to each other. If we unthread + move (slide) in one pass,
then after sliding A, it's possible that live objects between A and B will be
slid to A's original location. So when we unthread B which will have a pointer
to A's old location in its header we'll read random data from whatever's moved
to A's original location.
The solution is to do these in different passes: first unthread, then slide.
The result is still 3 passes as before, but for some reason that I don't
understand, this version is 14% faster on CanCan backend benchmark (42,005,245
instructions to 35,775,591).
I think "An Efficient Garbage Compaction Algorithm" is describing a way to
implement in-sliding compaction in two passes instead of three, but I found
that paper difficult to follow. I will revisit it now to see if this can be
improved further.