Compacting GC: merge threading and marking by osa1 · Pull Request #2641 · caffeinelabs/motoko

osa1 · 2021-07-01T15:13:05Z

This PR merges threading of object fields with marking. The result is
-14% in instructions in CanCan backend benchmark.

Changes:

Mark phase of compacting GC now threads fields of marked objects.
After marking all fields are now threaded.
Because all fields are threaded during marking, the pass for threading
is now gone.
However we still need two passes for unthreading and moving the
objects (see below). So the first pass after marking unthreads fields.
Then another pass moves objects.

Original idea was to merge marking and threading, and doing unthreading +
moving in one pass. It turns out that doesn't work. To see why, suppose the
heap is like this:

| ..., object A, ..., object B, ... |

where A and B point to each other. If we unthread + move (slide) in one pass,
then after sliding A, it's possible that live objects between A and B will be
slid to A's original location. So when we unthread B which will have a pointer
to A's old location in its header we'll read random data from whatever's moved
to A's original location.

The solution is to do these in different passes: first unthread, then slide.

The result is still 3 passes as before, but for some reason that I don't
understand, this version is 14% faster on CanCan backend benchmark (42,005,245
instructions to 35,775,591).

I think "An Efficient Garbage Compaction Algorithm" is describing a way to
implement in-sliding compaction in two passes instead of three, but I found
that paper difficult to follow. I will revisit it now to see if this can be
improved further.

This PR merges threading of object fields with marking. The result is -14% in instructions in CanCan backend benchmark. Changes: - Mark phase of compacting GC now threads fields of marked objects. After marking all fields are now threaded. - Because all fields are threaded during marking, the pass for threading is now gone. - However we still need two passes for unthreading and moving the objects (see below). So the first pass after marking unthreads fields. Then another pass moves objects. Original idea was to merge marking and threading, and doing unthreading + moving in one pass. It turns out that doesn't work. To see why, suppose the heap is like this: | ..., object N, ..., object M, ... | where N and M point to each other. If we unthread + move (slide) in one pass, then after sliding N, it's possible that live objects between N and M will be slid to N's original location. So when we unthread M which will have a pointer to N's old location in its header we'll read random data from whatever's moved to N's original location. The solution is to do these in different passes: first unthread, then slide. The result is still 3 passes as before, but for some reason that I don't understand, this version is 14% faster on CanCan backend benchmark (42,005,245 instructions to 35,775,591). I think "An Efficient Garbage Compaction Algorithm" is describing a way to implement in-sliding compaction in two passes instead of three, but I found that paper difficult to follow. I will revisit it now to see if this can be improved further.

osa1 · 2021-07-01T15:13:51Z

rts/motoko-rts-tests/src/gc.rs

-        heap.heap_base_offset(),
-        heap.heap_ptr_offset(),
-        heap.closure_table_ptr_offset(),
-    );


I removed this as it doesn't allow garbage in heap (as we don't want garbage after a GC). I should probably modify this to take a "post GC" parameter and check for garbage depending on that.

dfinity-ci · 2021-07-01T15:17:37Z

This PR does not affect the produced WebAssembly code.

crusso · 2021-07-02T10:59:44Z

rts/motoko-rts-tests/src/mark_stack.rs


        for obj in &objs {
-            push_mark_stack(mem, *obj as usize);
+            push_mark_stack(mem, *obj as usize, obj.wrapping_sub(1));


is that an unskew?

Not really, I just wanted a value derived from obj for the tag argument of push_mark_stack. Since the actual value does not matter (push/pop does not care about validness of tags) I more or less randomly used obj - 1.

Also explained this a little bit below. Do you have any ideas on how to update this test better?

Oh, never mind, I didn't realize this was just dummy data.

(should have looked at the context)

crusso · 2021-07-02T10:59:51Z

rts/motoko-rts-tests/src/mark_stack.rs

+        for obj in objs.iter().copied().rev() {
            let popped = pop_mark_stack();
-            if popped != Some(*obj as usize) {
+            if popped != Some((obj as usize, obj.wrapping_sub(1))) {


crusso · 2021-07-02T11:06:41Z

rts/motoko-rts-tests/src/mark_stack.rs


        for obj in &objs {
-            push_mark_stack(mem, *obj as usize);
+            push_mark_stack(mem, *obj as usize, obj.wrapping_sub(1));


Given the signature (from below)

pub unsafe fn push_mark_stack<M: Memory>(mem: &mut M, obj: usize, obj_tag: Tag)

Aren't the arguments above the wrong way round? Or is this a different push_mark_stack?

I.e. shouldn't this call be:

push_mark_stack(mem, obj.wrapping_sub(1), *obj as usize);

Is Tag just an abbreviation for usize? If it is, might be nice to make them distinct types.

So this part is a bit hacky. Now that push needs a pointer + tag (u32) I needed to come up with something for the tag here. The mark stack doesn't care about the tag so just randomly used value - 1. I could use a constant, but I thought perhaps that hides bugs.

Is Tag just an abbreviation for usize? If it is, might be nice to make them distinct types.

Yeah I think making them distinct makes sense. I'll do that in a separate PR.

rts/motoko-rts/src/gc/mark_compact.rs

crusso · 2021-07-02T11:21:30Z

rts/motoko-rts/src/gc/mark_compact/mark_stack.rs


-pub unsafe fn push_mark_stack<M: Memory>(mem: &mut M, obj: usize) {
+pub unsafe fn push_mark_stack<M: Memory>(mem: &mut M, obj: usize, obj_tag: Tag) {
+    // We add 2 words in a push, and `STACK_PTR` and `STACK_TOP` are both multiples of 2, so we can


Is the size of the mark stack ever an issue? If so, would it be worth compressing the 4-tag bits into something smaller than a word or word alignment more important?

Is the size of the mark stack ever an issue?

I had some benchmarks on mark stack sizes when I first implemented this GC, but those benchmarks were not realistic (more like micro benchmarks) so I should run that again on CanCan. In the worst case stack size will be heap size / 2 (one object points to every object in the heap), but that case never happens.

rts/motoko-rts/src/gc/mark_compact.rs

osa1 · 2021-07-02T14:06:44Z

The result is still 3 passes as before, but for some reason that I don't understand, this version is 14% faster

I was thinking about this and I think I have an idea. Previously the algorithm was something like this at a high level:

Mark live objects
Visit live objects from start to end, unthread and update forward pointers to the object's new location, thread fields
Visit live objects from start to end, unthread and update backward pointers to the object's new location, move the object

With this PR:

Mark live objects, thread fields
Visit live objects from start to end, unthread objects and update both forward and backward pointers to the object's new location
Visit live objects from start to end, move objects

I think the difference in performance is because with this PR we visit object fields once (to push fields to the mark stack and thread), instead of twice as before (once to push fields to the mark stack, once again to thread). This can be seen in the diff if you search for thread_obj_fields: the use in update_fwd_refs is now gone.

As mentioned in the PR description, "An Efficient Garbage Compaction Algorithm" describes a way to do this in two passes, but I found the paper difficult to follow. @crusso mentioned "High-Performance Garbage Collection for Memory-Constrained Environments" to me (thanks @crusso!) which describes a compaction algorithm that works in two passes in section 5.1.2. I don't know if it's the same as "An Efficient Garbage Collection Algorithm" or not, but it's quite simple and I think should improve our GC even more. Here's how it works: during marking we only thread backward pointers and skip forward pointers. So in our original example from PR description:

| ..., object A, ..., object B, ... |

during marking we only thread the field of B that points to A, not the field of A that points to B.

Now the second pass is similar, we scan the heap from start to end. When we see A we unthread it and update the pointers to A's final location, then move A, then thread its forward pointers. Now when we reach B we can safely unthread it as its header will be readable (i.e. won't be overwritten by objects between A and B).

It's quite smart and should improve this PR even more. I will implement this next week.

Co-authored-by: Claudio Russo <claudio@dfinity.org>

osa1 · 2021-07-02T14:29:19Z

should improve this PR even more

Actually I'm not sure about this, because we will have to visit object fields twice and the -14% will be lost. I will implement it in a new PR so that we can compare.

crusso · 2021-07-02T15:27:52Z

rts/motoko-rts/src/gc/mark_compact/mark_stack.rs

 }

-pub unsafe fn pop_mark_stack() -> Option<usize> {
+pub unsafe fn pop_mark_stack() -> Option<(usize, Tag)> {


I guess you could return a sentinal, not an option, for perf, but perhaps not worth the effort.

Maybe Rust even does that automatically? It does for some types, but the list doens't mention tuples :https://doc.rust-lang.org/std/option/#representation

It seems it does! https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=5007ac15b58e9dca74e4b54789e139f2

osa1 · 2021-07-05T06:32:24Z

Last commit removes a redundant pass over roots so it should improve perf even more. ~~I'll rerun the benchmark.~~ Bench reports -14.8% so less than 1% better.

rts/motoko-rts/src/gc/mark_compact.rs

Bench reports -2.7% in instructions

osa1 · 2021-07-05T08:19:49Z

After all the optimizations current status is -17.15% in instructions compared to master.

osa1 · 2021-07-05T08:58:57Z

#2647 performs better than this one, closing.

Remaining comments above are addressed in: #2648

This PR implements the mark-compact GC algorithm briefly described by "High-Performance Garbage Collection for Memory-Constrained Environments" section 5.1.2 (I think the original paper is "An Efficient Garbage Compaction Algorithm" but that paper is a bit difficult to follow). The idea is as follows: when marking an object we thread backwards pointers of that object. After we linearly scan the heap as before. For a live object, we unthread it and update backwards pointers to it to the objects new location, move the object, and then thread its forwards pointers. After the pass all objects are moved and all references are updated. This reduces number of passes in the original mark-compact collector to 2. CanCan backend benchmark reports -19.8% in instructions. Compared to copying GC, mark-compact GC is now 36% slower, instead of 70% as before. This PR also optimizes `mark_static_roots` by inlining `mark_fields` in the body and specializing it for `MutBox`: static root array only points to static `MutBox`es so no need to call more general `mark_fields`. This optimization gives us approximately -2% in instructions. See also #2641 for another, slower variant of the algorithm.

osa1 requested a review from crusso July 1, 2021 15:13

osa1 commented Jul 1, 2021

View reviewed changes

Fix formatting

e46d16e

crusso reviewed Jul 2, 2021

View reviewed changes

crusso requested changes Jul 2, 2021

View reviewed changes

osa1 and others added 2 commits July 2, 2021 17:18

Update rts/motoko-rts/src/gc/mark_compact.rs

9e3eb55

Co-authored-by: Claudio Russo <claudio@dfinity.org>

Merge branch 'master' into compacting_gc_4_rebase

7700a5c

osa1 mentioned this pull request Jul 2, 2021

Increase parallel jobs on GitHub CI #2621

Closed

nomeata mentioned this pull request Jul 2, 2021

dfinity-ci reports in PRs are no longer reliable #2631

Closed

crusso reviewed Jul 2, 2021

View reviewed changes

crusso approved these changes Jul 2, 2021

View reviewed changes

osa1 added 2 commits July 5, 2021 08:44

Merge branch 'master' into compacting_gc_4_rebase

35522c0

Remove redundant root threading pass

b7b83dc

osa1 mentioned this pull request Jul 5, 2021

Compacting GC: thread backwards pointers in marking #2647

Merged

osa1 commented Jul 5, 2021

View reviewed changes

rts/motoko-rts/src/gc/mark_compact.rs Outdated Show resolved Hide resolved

Specialize marking code for root MutBoxes

6e316c1

Bench reports -2.7% in instructions

osa1 closed this Jul 5, 2021

Conversation

osa1 commented Jul 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dfinity-ci commented Jul 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

osa1 Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

osa1 commented Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

osa1 commented Jul 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

osa1 commented Jul 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

osa1 commented Jul 5, 2021

Uh oh!

osa1 commented Jul 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

osa1 commented Jul 1, 2021 •

edited

Loading

osa1 Jul 2, 2021 •

edited

Loading

osa1 commented Jul 2, 2021 •

edited

Loading

osa1 commented Jul 5, 2021 •

edited

Loading