Attempt to GC mark already marked object #341

pavel-workato · 2023-06-28T16:00:35Z

Hello!
When calling MsgPack 1.7.0 we occasionally see following crashes in production. ~ 1 crash on every 1 million calls.

The underlying ruby code is

msgpack = ::MessagePack::Factory.new
msgpack.register_type( ... ) # some custom types for times, symbol, HWIA etc, quite similar to what paquito gem does
msgpack.dump({ ... }) # a pretty big ruby obj, with about 300k keys ~ about 20Mb binary output.

Before die VM reports this "[BUG] try to mark T_NONE object".

/app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/factory.rb:77: [BUG] try to mark T_NONE object
ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0217 p:---- s:1588 e:001587 CFUNC  :full_pack
c:0216 p:0020 s:1584 e:001583 METHOD /app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/factory.rb:77
...
-- C level backtrace information -------------------------------------------
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(rb_print_backtrace+0x15) [0x7fe112f5f615] vm_dump.c:755
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(rb_vm_bugreport+0x71b) [0x7fe112f5fd4b] vm_dump.c:941
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(rb_bug+0xe4) [0x7fe112d6b778] error.c:645
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(gc_mark_maybe+0x0) [0x7fe112dfba90] gc.c:5188
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(gc_mark_ptr) (null):0
/app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/msgpack.so(msgpack_buffer_mark+0x21) [0x7fe10b640fc1] buffer.c:119
/app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/msgpack.so(Packer_mark+0x9) [0x7fe10b6482a9] packer_class.c:50
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(gc_marks_rest+0x135) [0x7fe112e003b5] gc.c:5519
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(gc_start+0x61d) [0x7fe112e00abd] gc.c:6543
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(garbage_collect_with_gvl+0x7a) [0x7fe112e0161a] gc.c:7507
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(objspace_malloc_fixup+0x17) [0x7fe112e01ac2] gc.c:9834
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(objspace_xmalloc0) gc.c:9876
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(str_new0+0x11c) [0x7fe112ef300c] string.c:753
/app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/msgpack.so(msgpack_buffer_all_as_string+0x60) [0x7fe10b6417e0] buffer.c:511
/app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/msgpack.so(Packer_full_pack+0x2c) [0x7fe10b64843c] packer_class.c:408
....

Obvious thing is that buffer chunk somehow stores already marked mapped_string, but it is not clear why. Hope to find some help, insights or suggestion from you.

Should I find the cause in MsgPack or other places?
Ideas how to reproduce the bug, even artificially, to see the same error.
Maybe no need reproduce, but only add a guard to msgpack_buffer_mark.

Thanks!

The text was updated successfully, but these errors were encountered:

byroot · 2023-06-28T17:04:48Z

Should I find the cause in MsgPack or other places?

The C backtrace points at msgpack_buffer_mark, so yes, likely a missing GC_GUARD or similar in msgpack.

Ideas how to reproduce the bug, even artificially, to see the same error.

Have you tried GC.stress = true?

byroot · 2023-06-28T17:05:09Z

A reproduction would certainly help, but perhaps we'll be able to figure out the bug by auditing the code.

pavel-workato · 2023-06-28T17:15:01Z

Have you tried GC.stress = true?

Yes, tried and keep experimenting adding more variety into the test case. But so far experimenting on my local machines haven't succeed.

byroot · 2023-06-28T19:08:28Z

No worries, thanks for trying. I'll try to have a look at the mark function tomorrow.

peterzhu2118 · 2023-06-28T20:10:32Z

It appears that you are on Ruby 2.7, which is EOL, so it may have bugs that have not been fixed. Can you try upgrading Ruby to see if this issue still occurs?

pavel-workato · 2023-06-28T20:16:58Z

Oh, no, currently it is not possible. For sure, if there was a way to reproduce the issue, I'd have tried a newer ruby, but crashes happen in production env only.

casperisfine · 2023-06-29T07:24:47Z

Ok, so just braindumping what the backtrace tells us.

GC triggers when the "output" string is instantiated:

VALUE msgpack_buffer_all_as_string(msgpack_buffer_t* b)
{
    if(b->head == &b->tail) {
        return _msgpack_buffer_head_chunk_as_string(b);
    }

    size_t length = msgpack_buffer_all_readable_size(b);
    VALUE string = rb_str_new(NULL, length); // <--- HERE
    char* buffer = RSTRING_PTR(string);

Then a Packer is marked (probably the one we're using), which in turn delegate to marking its buffer, and it's the buffer_mark that crash when trying to mark a msgpack_buffer_chunk_t->mapped_string:

void msgpack_buffer_mark(void *ptr)
{
    msgpack_buffer_t* b = ptr;
    /* head is always available */
    msgpack_buffer_chunk_t* c = b->head;
    while(c != &b->tail) {
        rb_gc_mark(c->mapped_string); // <---- HERE
        c = c->next;
    }
    rb_gc_mark(c->mapped_string);

    rb_gc_mark(b->io);
    rb_gc_mark(b->io_buffer);
}

So far I've tried to add an explicit rb_gc_start() right when we allocate that string and run the test suite, but didn't get a repro.

I'll try to audit the chunk allocation and mapped_string assignation, it's possible that sometime we leave garbage in there.

Fix: msgpack#341 These struct contain a VALUE reference so if we don't zero it out, it could be pointing at a T_NONE or some other old object slot. Especially since we can re-use existing chunks.

casperisfine · 2023-06-29T07:57:07Z

Ok, so it's purely theoretical, but I think I found a way such crash could happen: #342

@pavel-workato could you try this branch?

gem "msgpack", github: "https://github.com/msgpack/msgpack-ruby/pull/342"

pavel-workato · 2023-06-29T09:00:55Z

@casperisfine Thank you for so prompt response! I took a look at the fix, and indeed it makes sense to me.
Yeah, let me try this PR on production, but given our crashes happen about once a week.. I'll get back with feedback in a week or two.

casperisfine · 2023-06-29T09:05:07Z

I'll get back with feedback in a week or two.

No worries, that wouldn't be the first time we have to wait a while before getting a confirmation :)

byroot · 2023-06-29T12:48:54Z

Reopening as the fix is only tentative for now.

sitano · 2023-07-04T09:12:20Z

@casperisfine I am trying to understand how #342 could affect what @pavel-workato saw in his stack trace and I am struggling to find out the flow, by looking at code, where non-zeroed mapped-string from the chunk appears right before the call to VALUE string = rb_str_new(NULL, length); at buffer.c:511 such that the dirty heap reference could interfere with the GC.

As I understand it, the idea is that dirty mapped-string VALUE must appear in the list tail before writing it with the correct value and the call to GC, so we must see something like that in the code

_msgpack_buffer_add_new_chunk(b); # adds dirty chunk on the heap
...
    VALUE string = rb_str_new(NULL, length); # GC in the middle over bad object on the heap
...
b->tail.mapped_string = GOOD_VALUE;

but by looking at all call sites of _msgpack_buffer_add_new_chunk(b) they all do b->tail.mapped_string = GOOD_VALUE; almost immediately after adding new chunk. Moreover, msgpack_buffer_all_as_string does not add new chunks.

Can you share if you see how the reused dirty chunk could appear in the code to satisfy the stack trace we see and the related fix?

casperisfine · 2023-07-04T10:22:52Z

So my theory on what is happening is:

First a chunk is used with a mapped_string and released:

msgpack_buffer_chunk_t *chunk = _msgpack_buffer_chunk_malloc(...)
chunk->mapped_string = rb_buf_new(...)

And then that chunk is released, hence added to the free list (via _msgpack_buffer_shift_chunk).

Then later on, we allocate a new chunk again, and that old one is re-used, still pointing the the old VALUE reference which since then was GCed, so is a T_NONE.

but by looking at all call sites of _msgpack_buffer_add_new_chunk(b) they all do b->tail.mapped_string = GOOD_VALUE;

That is true for _msgpack_buffer_append_reference:

    _msgpack_buffer_add_new_chunk(b);

    char* data = RSTRING_PTR(mapped_string);
    size_t length = RSTRING_LEN(mapped_string);

    b->tail.first = (char*) data;
    b->tail.last = (char*) data + length;
    b->tail.mapped_string = mapped_string;
    b->tail.mem = NULL;

No possible allocation here between _msgpack_buffer_add_new_chunk and b->tail.mapped_string = mapped_string;

The other call site (_msgpack_buffer_expand), however does allocate before setting mapped_string, so it could trigger GC there:

        /* allocate new chunk */
        _msgpack_buffer_add_new_chunk(b);

        char* mem = _msgpack_buffer_chunk_malloc(b, &b->tail, length, &capacity); // GC may trigger here

So if you wish to reproduct this, you need quite a few stars to align:

First you need a free chunk in the free list, that this pointing to a T_NONE slot.
Then you need to append to a packer with a large enough chunk so that's it's bigger than MSGPACK_RMEM_PAGE_SIZE (4kiB).
You need the GC to trigger on that last chunk allocation.

I'd love a repro, but it seems close to impossible to get all this happening in a controlled way. But perhaps by creating the required conditions and running it in a loop with some GC.start and GC.stress sprinkled in the right place, you might be able to make it happen after a while.

sitano · 2023-07-04T11:02:45Z

@casperisfine +1 - I also saw those 2 cases. What I am thinking of is that those 2 cases do not relate to the problem with Pavel's backtrace in buffer.c:511. Also, we don't see _msgpack_buffer_expand on the stack as well as _msgpack_buffer_chunk_malloc. Thus, in my understanding, we still need to find how the bad chunk could appear on the list in the middle of rb_str_new() call.

casperisfine · 2023-07-04T11:07:41Z

Also, we don't see _msgpack_buffer_expand on the stack.

Hum, true, but it could be from another thread?

Are you @pavel-workato's coworker? If so, Does this code run in a multi-threaded environment (e.g. sidekiq or puma)?

Since you use register_type, when the packer calls a custom type proc, the GVL is released.

pavel-workato · 2023-07-04T11:22:36Z

Hi! Indeed, there are some other auxiliary threads run in parallel, though the msgpack factory instance is not shared between any threads.

casperisfine · 2023-07-04T11:25:25Z

the msgpack factory instance is not shared between any threads.

My question wasn't wether the instances are shared between threads, but whether multiple threads are using msgpack (even independently).

Because if so the packer being marked could not be the one that is on the stack, but another one in use by another thread.

casperisfine · 2023-07-04T11:27:16Z

Either way, even if we don't have a specific repro or culprit (which I'd love to have), we can deduce for sure mapped_string point to garbage, and zeroing out the memory after allocation should fix this entire class of errors.

sitano · 2023-07-11T17:21:54Z

@casperisfine @byroot hi! we have tested your patch #342 and it did not help. it still crashes.

luckily, we have a local system in which the crash reliably reproduces. its a relatively large amount of data. however, running the msgpack serialization test purely on that data outside of our system env does not reproduce the bug. it however should not be related to our system as it reliably works under very heavy load without message pack serialization.

in our local tests we have added sync GC call right before rb_str_new that started to crash the VM every time:

VALUE msgpack_buffer_all_as_string(msgpack_buffer_t* b)
{
    if(b->head == &b->tail) {
        return _msgpack_buffer_head_chunk_as_string(b);
    }

    size_t length = msgpack_buffer_all_readable_size(b);

    rb_eval_string("GC.start full_mark: true, immediate_mark: true, immediate_sweep: true"); // <--- crashes in our beloved (in mark) place
    VALUE string = rb_str_new(NULL, length);

we had success to extract the state of the buffer chunks list right before running into a T_NONE object:

>0: 0x60000973c210 - 0 - 0
>1: 0x60000948c5d0 - 0 - 0
>2: 0x60000948c630 - 7fc0cd0bf520 - 1
>3: 0x60000948c690 - 0 - 0
>4: 0x600009488b70 - 0 - 0
>5: 0x600009488ba0 - 0 - 0
>6: 0x6000097cbe10 - 0 - 0
>7: 0x6000097cb450 - 0 - 0
>8: 0x6000094880f0 - 0 - 0
>9: 0x60000948b510 - 0 - 0
>10: 0x60000948b4e0 - 7fc18b2ff690 - 0
>11: 0x60000948b4b0 - 0 - 0
>12: 0x6000097cb5d0 - 0 - 0
>13: 0x6000097cb660 - 0 - 0
>14: 0x60000948c3c0 - 0 - 0
>15: 0x600009761680 - 0 - 0
>16: 0x60000975cba0 - 7fc0eb2a5f40 - 0
>17: 0x60000975cb70 - 0 - 0
>18: 0x60000948c0c0 - 0 - 0
>19: 0x60000948c120 - 7fc14c4013f8 - 1
>20: 0x60000970dc50 - 0 - 0
>21: 0x6000097cb510 - 0 - 0
>22: 0x6000097cb4e0 - 7fc0eb296110 - 0
>23: 0x6000097cb5a0 - 0 - 0
>24: 0x6000025e3110 - 0 - 0

where the format is: chunk number: chunk pointer - mapped string VALUE - is it T_NONE?

as you can see, we have a very strange distribution of NO_MAPPED_STRINGs, alive mapped_strings and T_NONE strings. maybe you will have an idea about how it could happen.

the last test we have tried today was to disable the GC completely but this explicit sync call that we added and measure the T_NONE objects before and after. it turned out that there were none of the T_NONE objects appeared on the list right after the full sync GC calls. so we can conclude it somehow happens in other way.

any ideas?

byroot · 2023-07-11T17:24:36Z

Thank you for the extra data, I'll will have another look first thing tomorrow.

You pinpointing exactly where the GC triggers will certainly help.

casperisfine · 2023-07-12T09:58:56Z

Hum, I'm starting to think the assertion in #342 was correct, but that the fix was wrong:

Looking at _msgpack_buffer_add_new_chunk:

        msgpack_buffer_chunk_t* nc = _msgpack_buffer_alloc_new_chunk(b);

        if(b->rmem_last == b->tail_buffer_end) {
            /* reuse unused rmem space */
            size_t unused = b->tail_buffer_end - b->tail.last;
            b->rmem_last -= unused;
        }

        /* rebuild tail */
        *nc = b->tail;
        before_tail->next = nc;
        nc->next = &b->tail;

_msgpack_buffer_alloc_new_chunk now returns us a properly zero-ed out chunk, but that's useless because we immediately copy the previous tail chunk into it. So I think we just zero-ed out the wrong chunk.

What is happening is that when we call into a recursive packing proc, we first save the packer buffer state onto the stack and then reset the buffer. Once we return from the proc, the original buffer state is copied back. The problem with this is that if any of the chunk has a mapped string then they are not reachable by any Ruby object and may be garbage collected at any moment.

casperisfine · 2023-07-18T07:49:21Z

I just released 1.7.2 with the fix for this.

casperisfine mentioned this issue Jun 29, 2023

Zero-out msgpack_buffer_chunk_t after allocation #342

Merged

byroot closed this as completed in #342 Jun 29, 2023

byroot reopened this Jun 29, 2023

casperisfine mentioned this issue Jul 12, 2023

_msgpack_buffer_add_new_chunk zero-out the newly allocated tail #343

Open

pavel-workato mentioned this issue Jul 12, 2023

Reproducer for issue 341 #344

Closed

casperisfine mentioned this issue Jul 12, 2023

Mark buffers held onto the stack #346

Closed

casperisfine mentioned this issue Jul 13, 2023

Packer: store the parent buffer mapped strings on the stack #347

Merged

byroot closed this as completed in #347 Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to GC mark already marked object #341

Attempt to GC mark already marked object #341

pavel-workato commented Jun 28, 2023

byroot commented Jun 28, 2023

byroot commented Jun 28, 2023

pavel-workato commented Jun 28, 2023

byroot commented Jun 28, 2023

peterzhu2118 commented Jun 28, 2023

pavel-workato commented Jun 28, 2023

casperisfine commented Jun 29, 2023

casperisfine commented Jun 29, 2023

pavel-workato commented Jun 29, 2023

casperisfine commented Jun 29, 2023

byroot commented Jun 29, 2023

sitano commented Jul 4, 2023 •

edited

Loading

casperisfine commented Jul 4, 2023

sitano commented Jul 4, 2023 •

edited

Loading

casperisfine commented Jul 4, 2023

pavel-workato commented Jul 4, 2023

casperisfine commented Jul 4, 2023

casperisfine commented Jul 4, 2023

sitano commented Jul 11, 2023

byroot commented Jul 11, 2023

casperisfine commented Jul 12, 2023

casperisfine commented Jul 18, 2023

Attempt to GC mark already marked object #341

Attempt to GC mark already marked object #341

Comments

pavel-workato commented Jun 28, 2023

byroot commented Jun 28, 2023

byroot commented Jun 28, 2023

pavel-workato commented Jun 28, 2023

byroot commented Jun 28, 2023

peterzhu2118 commented Jun 28, 2023

pavel-workato commented Jun 28, 2023

casperisfine commented Jun 29, 2023

casperisfine commented Jun 29, 2023

pavel-workato commented Jun 29, 2023

casperisfine commented Jun 29, 2023

byroot commented Jun 29, 2023

sitano commented Jul 4, 2023 • edited Loading

casperisfine commented Jul 4, 2023

sitano commented Jul 4, 2023 • edited Loading

casperisfine commented Jul 4, 2023

pavel-workato commented Jul 4, 2023

casperisfine commented Jul 4, 2023

casperisfine commented Jul 4, 2023

sitano commented Jul 11, 2023

byroot commented Jul 11, 2023

casperisfine commented Jul 12, 2023

casperisfine commented Jul 18, 2023

sitano commented Jul 4, 2023 •

edited

Loading

sitano commented Jul 4, 2023 •

edited

Loading