Skip to content

Compacting GC: merge threading and marking#2641

Closed
osa1 wants to merge 7 commits intomasterfrom
compacting_gc_4_rebase
Closed

Compacting GC: merge threading and marking#2641
osa1 wants to merge 7 commits intomasterfrom
compacting_gc_4_rebase

Conversation

@osa1
Copy link
Contributor

@osa1 osa1 commented Jul 1, 2021

This PR merges threading of object fields with marking. The result is
-14% in instructions in CanCan backend benchmark.

Changes:

  • Mark phase of compacting GC now threads fields of marked objects.
    After marking all fields are now threaded.

  • Because all fields are threaded during marking, the pass for threading
    is now gone.

  • However we still need two passes for unthreading and moving the
    objects (see below). So the first pass after marking unthreads fields.
    Then another pass moves objects.

Original idea was to merge marking and threading, and doing unthreading +
moving in one pass. It turns out that doesn't work. To see why, suppose the
heap is like this:

| ..., object A, ..., object B, ... |

where A and B point to each other. If we unthread + move (slide) in one pass,
then after sliding A, it's possible that live objects between A and B will be
slid to A's original location. So when we unthread B which will have a pointer
to A's old location in its header we'll read random data from whatever's moved
to A's original location.

The solution is to do these in different passes: first unthread, then slide.

The result is still 3 passes as before, but for some reason that I don't
understand, this version is 14% faster on CanCan backend benchmark (42,005,245
instructions to 35,775,591).

I think "An Efficient Garbage Compaction Algorithm" is describing a way to
implement in-sliding compaction in two passes instead of three, but I found
that paper difficult to follow. I will revisit it now to see if this can be
improved further.

This PR merges threading of object fields with marking. The result is
-14% in instructions in CanCan backend benchmark.

Changes:

- Mark phase of compacting GC now threads fields of marked objects.
  After marking all fields are now threaded.

- Because all fields are threaded during marking, the pass for threading
  is now gone.

- However we still need two passes for unthreading and moving the
  objects (see below). So the first pass after marking unthreads fields.
  Then another pass moves objects.

Original idea was to merge marking and threading, and doing unthreading +
moving in one pass. It turns out that doesn't work. To see why, suppose the
heap is like this:

    | ..., object N, ..., object M, ... |

where N and M point to each other. If we unthread + move (slide) in one pass,
then after sliding N, it's possible that live objects between N and M will be
slid to N's original location. So when we unthread M which will have a pointer
to N's old location in its header we'll read random data from whatever's moved
to N's original location.

The solution is to do these in different passes: first unthread, then slide.

The result is still 3 passes as before, but for some reason that I don't
understand, this version is 14% faster on CanCan backend benchmark (42,005,245
instructions to 35,775,591).

I think "An Efficient Garbage Compaction Algorithm" is describing a way to
implement in-sliding compaction in two passes instead of three, but I found
that paper difficult to follow. I will revisit it now to see if this can be
improved further.
@osa1 osa1 requested a review from crusso July 1, 2021 15:13
heap.heap_base_offset(),
heap.heap_ptr_offset(),
heap.closure_table_ptr_offset(),
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this as it doesn't allow garbage in heap (as we don't want garbage after a GC). I should probably modify this to take a "post GC" parameter and check for garbage depending on that.

@dfinity-ci
Copy link

This PR does not affect the produced WebAssembly code.


for obj in &objs {
push_mark_stack(mem, *obj as usize);
push_mark_stack(mem, *obj as usize, obj.wrapping_sub(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that an unskew?

Copy link
Contributor Author

@osa1 osa1 Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I just wanted a value derived from obj for the tag argument of push_mark_stack. Since the actual value does not matter (push/pop does not care about validness of tags) I more or less randomly used obj - 1.

Also explained this a little bit below. Do you have any ideas on how to update this test better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, never mind, I didn't realize this was just dummy data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(should have looked at the context)

for obj in objs.iter().copied().rev() {
let popped = pop_mark_stack();
if popped != Some(*obj as usize) {
if popped != Some((obj as usize, obj.wrapping_sub(1))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


for obj in &objs {
push_mark_stack(mem, *obj as usize);
push_mark_stack(mem, *obj as usize, obj.wrapping_sub(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the signature (from below)

pub unsafe fn push_mark_stack<M: Memory>(mem: &mut M, obj: usize, obj_tag: Tag) 

Aren't the arguments above the wrong way round? Or is this a different push_mark_stack?

I.e. shouldn't this call be:

push_mark_stack(mem, obj.wrapping_sub(1), *obj as usize);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Tag just an abbreviation for usize? If it is, might be nice to make them distinct types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this part is a bit hacky. Now that push needs a pointer + tag (u32) I needed to come up with something for the tag here. The mark stack doesn't care about the tag so just randomly used value - 1. I could use a constant, but I thought perhaps that hides bugs.

Is Tag just an abbreviation for usize? If it is, might be nice to make them distinct types.

Yeah I think making them distinct makes sense. I'll do that in a separate PR.


pub unsafe fn push_mark_stack<M: Memory>(mem: &mut M, obj: usize) {
pub unsafe fn push_mark_stack<M: Memory>(mem: &mut M, obj: usize, obj_tag: Tag) {
// We add 2 words in a push, and `STACK_PTR` and `STACK_TOP` are both multiples of 2, so we can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the size of the mark stack ever an issue? If so, would it be worth compressing the 4-tag bits into something smaller than a word or word alignment more important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the size of the mark stack ever an issue?

I had some benchmarks on mark stack sizes when I first implemented this GC, but those benchmarks were not realistic (more like micro benchmarks) so I should run that again on CanCan. In the worst case stack size will be heap size / 2 (one object points to every object in the heap), but that case never happens.

@osa1
Copy link
Contributor Author

osa1 commented Jul 2, 2021

The result is still 3 passes as before, but for some reason that I don't understand, this version is 14% faster

I was thinking about this and I think I have an idea. Previously the algorithm was something like this at a high level:

  • Mark live objects
  • Visit live objects from start to end, unthread and update forward pointers to the object's new location, thread fields
  • Visit live objects from start to end, unthread and update backward pointers to the object's new location, move the object

With this PR:

  • Mark live objects, thread fields
  • Visit live objects from start to end, unthread objects and update both forward and backward pointers to the object's new location
  • Visit live objects from start to end, move objects

I think the difference in performance is because with this PR we visit object fields once (to push fields to the mark stack and thread), instead of twice as before (once to push fields to the mark stack, once again to thread). This can be seen in the diff if you search for thread_obj_fields: the use in update_fwd_refs is now gone.


As mentioned in the PR description, "An Efficient Garbage Compaction Algorithm" describes a way to do this in two passes, but I found the paper difficult to follow. @crusso mentioned "High-Performance Garbage Collection for Memory-Constrained Environments" to me (thanks @crusso!) which describes a compaction algorithm that works in two passes in section 5.1.2. I don't know if it's the same as "An Efficient Garbage Collection Algorithm" or not, but it's quite simple and I think should improve our GC even more. Here's how it works: during marking we only thread backward pointers and skip forward pointers. So in our original example from PR description:

| ..., object A, ..., object B, ... |

during marking we only thread the field of B that points to A, not the field of A that points to B.

Now the second pass is similar, we scan the heap from start to end. When we see A we unthread it and update the pointers to A's final location, then move A, then thread its forward pointers. Now when we reach B we can safely unthread it as its header will be readable (i.e. won't be overwritten by objects between A and B).

It's quite smart and should improve this PR even more. I will implement this next week.

@osa1
Copy link
Contributor Author

osa1 commented Jul 2, 2021

should improve this PR even more

Actually I'm not sure about this, because we will have to visit object fields twice and the -14% will be lost. I will implement it in a new PR so that we can compare.

}

pub unsafe fn pop_mark_stack() -> Option<usize> {
pub unsafe fn pop_mark_stack() -> Option<(usize, Tag)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could return a sentinal, not an option, for perf, but perhaps not worth the effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Rust even does that automatically? It does for some types, but the list doens't mention tuples :https://doc.rust-lang.org/std/option/#representation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osa1
Copy link
Contributor Author

osa1 commented Jul 5, 2021

Last commit removes a redundant pass over roots so it should improve perf even more. I'll rerun the benchmark. Bench reports -14.8% so less than 1% better.

Bench reports -2.7% in instructions
@osa1
Copy link
Contributor Author

osa1 commented Jul 5, 2021

After all the optimizations current status is -17.15% in instructions compared to master.

@osa1
Copy link
Contributor Author

osa1 commented Jul 5, 2021

#2647 performs better than this one, closing.

Remaining comments above are addressed in: #2648

@osa1 osa1 closed this Jul 5, 2021
mergify bot pushed a commit that referenced this pull request Jul 5, 2021
This PR implements the mark-compact GC algorithm briefly described by
"High-Performance Garbage Collection for Memory-Constrained Environments"
section 5.1.2 (I think the original paper is "An Efficient Garbage Compaction
Algorithm" but that paper is a bit difficult to follow).

The idea is as follows: when marking an object we thread backwards pointers of
that object. After we linearly scan the heap as before. For a live object, we
unthread it and update backwards pointers to it to the objects new location,
move the object, and then thread its forwards pointers. After the pass all
objects are moved and all references are updated.

This reduces number of passes in the original mark-compact collector to 2.

CanCan backend benchmark reports -19.8% in instructions. Compared to copying
GC, mark-compact GC is now 36% slower, instead of 70% as before.

This PR also optimizes `mark_static_roots` by inlining `mark_fields` in the
body and specializing it for `MutBox`: static root array only points to static
`MutBox`es so no need to call more general `mark_fields`. This optimization
gives us approximately -2% in instructions.

See also #2641 for another, slower variant of the algorithm.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments