Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to GC mark already marked object #341

Closed
pavel-workato opened this issue Jun 28, 2023 · 22 comments · Fixed by #342 or #347
Closed

Attempt to GC mark already marked object #341

pavel-workato opened this issue Jun 28, 2023 · 22 comments · Fixed by #342 or #347

Comments

@pavel-workato
Copy link

Hello!
When calling MsgPack 1.7.0 we occasionally see following crashes in production. ~ 1 crash on every 1 million calls.

The underlying ruby code is

msgpack = ::MessagePack::Factory.new
msgpack.register_type( ... ) # some custom types for times, symbol, HWIA etc, quite similar to what paquito gem does
msgpack.dump({ ... }) # a pretty big ruby obj, with about 300k keys ~ about 20Mb binary output.

Before die VM reports this "[BUG] try to mark T_NONE object".

/app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/factory.rb:77: [BUG] try to mark T_NONE object
ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0217 p:---- s:1588 e:001587 CFUNC  :full_pack
c:0216 p:0020 s:1584 e:001583 METHOD /app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/factory.rb:77
...
-- C level backtrace information -------------------------------------------
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(rb_print_backtrace+0x15) [0x7fe112f5f615] vm_dump.c:755
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(rb_vm_bugreport+0x71b) [0x7fe112f5fd4b] vm_dump.c:941
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(rb_bug+0xe4) [0x7fe112d6b778] error.c:645
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(gc_mark_maybe+0x0) [0x7fe112dfba90] gc.c:5188
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(gc_mark_ptr) (null):0
/app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/msgpack.so(msgpack_buffer_mark+0x21) [0x7fe10b640fc1] buffer.c:119
/app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/msgpack.so(Packer_mark+0x9) [0x7fe10b6482a9] packer_class.c:50
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(gc_marks_rest+0x135) [0x7fe112e003b5] gc.c:5519
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(gc_start+0x61d) [0x7fe112e00abd] gc.c:6543
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(garbage_collect_with_gvl+0x7a) [0x7fe112e0161a] gc.c:7507
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(objspace_malloc_fixup+0x17) [0x7fe112e01ac2] gc.c:9834
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(objspace_xmalloc0) gc.c:9876
/app/vendor/ruby-2.7.7/bin/../lib/libruby.so.2.7(str_new0+0x11c) [0x7fe112ef300c] string.c:753
/app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/msgpack.so(msgpack_buffer_all_as_string+0x60) [0x7fe10b6417e0] buffer.c:511
/app/vendor/bundle/ruby/2.7.0/gems/msgpack-1.7.0/lib/msgpack/msgpack.so(Packer_full_pack+0x2c) [0x7fe10b64843c] packer_class.c:408
....

Obvious thing is that buffer chunk somehow stores already marked mapped_string, but it is not clear why. Hope to find some help, insights or suggestion from you.

  • Should I find the cause in MsgPack or other places?
  • Ideas how to reproduce the bug, even artificially, to see the same error.
  • Maybe no need reproduce, but only add a guard to msgpack_buffer_mark.

Thanks!

@byroot
Copy link
Member

byroot commented Jun 28, 2023

  • Should I find the cause in MsgPack or other places?

The C backtrace points at msgpack_buffer_mark, so yes, likely a missing GC_GUARD or similar in msgpack.

Ideas how to reproduce the bug, even artificially, to see the same error.

Have you tried GC.stress = true?

@byroot
Copy link
Member

byroot commented Jun 28, 2023

A reproduction would certainly help, but perhaps we'll be able to figure out the bug by auditing the code.

@pavel-workato
Copy link
Author

Have you tried GC.stress = true?

Yes, tried and keep experimenting adding more variety into the test case. But so far experimenting on my local machines haven't succeed.

@byroot
Copy link
Member

byroot commented Jun 28, 2023

No worries, thanks for trying. I'll try to have a look at the mark function tomorrow.

@peterzhu2118
Copy link
Member

It appears that you are on Ruby 2.7, which is EOL, so it may have bugs that have not been fixed. Can you try upgrading Ruby to see if this issue still occurs?

@pavel-workato
Copy link
Author

Oh, no, currently it is not possible. For sure, if there was a way to reproduce the issue, I'd have tried a newer ruby, but crashes happen in production env only.

@casperisfine
Copy link

Ok, so just braindumping what the backtrace tells us.

GC triggers when the "output" string is instantiated:

VALUE msgpack_buffer_all_as_string(msgpack_buffer_t* b)
{
    if(b->head == &b->tail) {
        return _msgpack_buffer_head_chunk_as_string(b);
    }

    size_t length = msgpack_buffer_all_readable_size(b);
    VALUE string = rb_str_new(NULL, length); // <--- HERE
    char* buffer = RSTRING_PTR(string);

Then a Packer is marked (probably the one we're using), which in turn delegate to marking its buffer, and it's the buffer_mark that crash when trying to mark a msgpack_buffer_chunk_t->mapped_string:

void msgpack_buffer_mark(void *ptr)
{
    msgpack_buffer_t* b = ptr;
    /* head is always available */
    msgpack_buffer_chunk_t* c = b->head;
    while(c != &b->tail) {
        rb_gc_mark(c->mapped_string); // <---- HERE
        c = c->next;
    }
    rb_gc_mark(c->mapped_string);

    rb_gc_mark(b->io);
    rb_gc_mark(b->io_buffer);
}

So far I've tried to add an explicit rb_gc_start() right when we allocate that string and run the test suite, but didn't get a repro.

I'll try to audit the chunk allocation and mapped_string assignation, it's possible that sometime we leave garbage in there.

casperisfine pushed a commit to Shopify/msgpack-ruby that referenced this issue Jun 29, 2023
Fix: msgpack#341

These struct contain a VALUE reference so if we don't zero
it out, it could be pointing at a T_NONE or some other old
object slot.

Especially since we can re-use existing chunks.
@casperisfine
Copy link

Ok, so it's purely theoretical, but I think I found a way such crash could happen: #342

@pavel-workato could you try this branch?

gem "msgpack", github: "https://github.com/msgpack/msgpack-ruby/pull/342"

@pavel-workato
Copy link
Author

@casperisfine Thank you for so prompt response! I took a look at the fix, and indeed it makes sense to me.
Yeah, let me try this PR on production, but given our crashes happen about once a week.. I'll get back with feedback in a week or two.

@casperisfine
Copy link

I'll get back with feedback in a week or two.

No worries, that wouldn't be the first time we have to wait a while before getting a confirmation :)

@byroot
Copy link
Member

byroot commented Jun 29, 2023

Reopening as the fix is only tentative for now.

@byroot byroot reopened this Jun 29, 2023
@sitano
Copy link

sitano commented Jul 4, 2023

@casperisfine I am trying to understand how #342 could affect what @pavel-workato saw in his stack trace and I am struggling to find out the flow, by looking at code, where non-zeroed mapped-string from the chunk appears right before the call to VALUE string = rb_str_new(NULL, length); at buffer.c:511 such that the dirty heap reference could interfere with the GC.

As I understand it, the idea is that dirty mapped-string VALUE must appear in the list tail before writing it with the correct value and the call to GC, so we must see something like that in the code

_msgpack_buffer_add_new_chunk(b); # adds dirty chunk on the heap
...
    VALUE string = rb_str_new(NULL, length); # GC in the middle over bad object on the heap
...
b->tail.mapped_string = GOOD_VALUE;

but by looking at all call sites of _msgpack_buffer_add_new_chunk(b) they all do b->tail.mapped_string = GOOD_VALUE; almost immediately after adding new chunk. Moreover, msgpack_buffer_all_as_string does not add new chunks.

Can you share if you see how the reused dirty chunk could appear in the code to satisfy the stack trace we see and the related fix?

@casperisfine
Copy link

So my theory on what is happening is:

First a chunk is used with a mapped_string and released:

msgpack_buffer_chunk_t *chunk = _msgpack_buffer_chunk_malloc(...)
chunk->mapped_string = rb_buf_new(...)

And then that chunk is released, hence added to the free list (via _msgpack_buffer_shift_chunk).

Then later on, we allocate a new chunk again, and that old one is re-used, still pointing the the old VALUE reference which since then was GCed, so is a T_NONE.

but by looking at all call sites of _msgpack_buffer_add_new_chunk(b) they all do b->tail.mapped_string = GOOD_VALUE;

That is true for _msgpack_buffer_append_reference:

    _msgpack_buffer_add_new_chunk(b);

    char* data = RSTRING_PTR(mapped_string);
    size_t length = RSTRING_LEN(mapped_string);

    b->tail.first = (char*) data;
    b->tail.last = (char*) data + length;
    b->tail.mapped_string = mapped_string;
    b->tail.mem = NULL;

No possible allocation here between _msgpack_buffer_add_new_chunk and b->tail.mapped_string = mapped_string;

The other call site (_msgpack_buffer_expand), however does allocate before setting mapped_string, so it could trigger GC there:

        /* allocate new chunk */
        _msgpack_buffer_add_new_chunk(b);

        char* mem = _msgpack_buffer_chunk_malloc(b, &b->tail, length, &capacity); // GC may trigger here

So if you wish to reproduct this, you need quite a few stars to align:

  • First you need a free chunk in the free list, that this pointing to a T_NONE slot.
  • Then you need to append to a packer with a large enough chunk so that's it's bigger than MSGPACK_RMEM_PAGE_SIZE (4kiB).
  • You need the GC to trigger on that last chunk allocation.

I'd love a repro, but it seems close to impossible to get all this happening in a controlled way. But perhaps by creating the required conditions and running it in a loop with some GC.start and GC.stress sprinkled in the right place, you might be able to make it happen after a while.

@sitano
Copy link

sitano commented Jul 4, 2023

@casperisfine +1 - I also saw those 2 cases. What I am thinking of is that those 2 cases do not relate to the problem with Pavel's backtrace in buffer.c:511. Also, we don't see _msgpack_buffer_expand on the stack as well as _msgpack_buffer_chunk_malloc. Thus, in my understanding, we still need to find how the bad chunk could appear on the list in the middle of rb_str_new() call.

@casperisfine
Copy link

Also, we don't see _msgpack_buffer_expand on the stack.

Hum, true, but it could be from another thread?

Are you @pavel-workato's coworker? If so, Does this code run in a multi-threaded environment (e.g. sidekiq or puma)?

Since you use register_type, when the packer calls a custom type proc, the GVL is released.

@pavel-workato
Copy link
Author

Hi! Indeed, there are some other auxiliary threads run in parallel, though the msgpack factory instance is not shared between any threads.

@casperisfine
Copy link

the msgpack factory instance is not shared between any threads.

My question wasn't wether the instances are shared between threads, but whether multiple threads are using msgpack (even independently).

Because if so the packer being marked could not be the one that is on the stack, but another one in use by another thread.

@casperisfine
Copy link

Either way, even if we don't have a specific repro or culprit (which I'd love to have), we can deduce for sure mapped_string point to garbage, and zeroing out the memory after allocation should fix this entire class of errors.

@sitano
Copy link

sitano commented Jul 11, 2023

@casperisfine @byroot hi! we have tested your patch #342 and it did not help. it still crashes.

luckily, we have a local system in which the crash reliably reproduces. its a relatively large amount of data. however, running the msgpack serialization test purely on that data outside of our system env does not reproduce the bug. it however should not be related to our system as it reliably works under very heavy load without message pack serialization.

in our local tests we have added sync GC call right before rb_str_new that started to crash the VM every time:

VALUE msgpack_buffer_all_as_string(msgpack_buffer_t* b)
{
    if(b->head == &b->tail) {
        return _msgpack_buffer_head_chunk_as_string(b);
    }

    size_t length = msgpack_buffer_all_readable_size(b);

    rb_eval_string("GC.start full_mark: true, immediate_mark: true, immediate_sweep: true"); // <--- crashes in our beloved (in mark) place
    VALUE string = rb_str_new(NULL, length);

we had success to extract the state of the buffer chunks list right before running into a T_NONE object:

>0: 0x60000973c210 - 0 - 0
>1: 0x60000948c5d0 - 0 - 0
>2: 0x60000948c630 - 7fc0cd0bf520 - 1
>3: 0x60000948c690 - 0 - 0
>4: 0x600009488b70 - 0 - 0
>5: 0x600009488ba0 - 0 - 0
>6: 0x6000097cbe10 - 0 - 0
>7: 0x6000097cb450 - 0 - 0
>8: 0x6000094880f0 - 0 - 0
>9: 0x60000948b510 - 0 - 0
>10: 0x60000948b4e0 - 7fc18b2ff690 - 0
>11: 0x60000948b4b0 - 0 - 0
>12: 0x6000097cb5d0 - 0 - 0
>13: 0x6000097cb660 - 0 - 0
>14: 0x60000948c3c0 - 0 - 0
>15: 0x600009761680 - 0 - 0
>16: 0x60000975cba0 - 7fc0eb2a5f40 - 0
>17: 0x60000975cb70 - 0 - 0
>18: 0x60000948c0c0 - 0 - 0
>19: 0x60000948c120 - 7fc14c4013f8 - 1
>20: 0x60000970dc50 - 0 - 0
>21: 0x6000097cb510 - 0 - 0
>22: 0x6000097cb4e0 - 7fc0eb296110 - 0
>23: 0x6000097cb5a0 - 0 - 0
>24: 0x6000025e3110 - 0 - 0

where the format is: chunk number: chunk pointer - mapped string VALUE - is it T_NONE?

as you can see, we have a very strange distribution of NO_MAPPED_STRINGs, alive mapped_strings and T_NONE strings. maybe you will have an idea about how it could happen.

the last test we have tried today was to disable the GC completely but this explicit sync call that we added and measure the T_NONE objects before and after. it turned out that there were none of the T_NONE objects appeared on the list right after the full sync GC calls. so we can conclude it somehow happens in other way.

any ideas?

@byroot
Copy link
Member

byroot commented Jul 11, 2023

Thank you for the extra data, I'll will have another look first thing tomorrow.

You pinpointing exactly where the GC triggers will certainly help.

@casperisfine
Copy link

Hum, I'm starting to think the assertion in #342 was correct, but that the fix was wrong:

Looking at _msgpack_buffer_add_new_chunk:

        msgpack_buffer_chunk_t* nc = _msgpack_buffer_alloc_new_chunk(b);

        if(b->rmem_last == b->tail_buffer_end) {
            /* reuse unused rmem space */
            size_t unused = b->tail_buffer_end - b->tail.last;
            b->rmem_last -= unused;
        }

        /* rebuild tail */
        *nc = b->tail;
        before_tail->next = nc;
        nc->next = &b->tail;

_msgpack_buffer_alloc_new_chunk now returns us a properly zero-ed out chunk, but that's useless because we immediately copy the previous tail chunk into it. So I think we just zero-ed out the wrong chunk.

casperisfine pushed a commit to Shopify/msgpack-ruby that referenced this issue Jul 12, 2023
What is happening is that when we call into a recursive packing
proc, we first save the packer buffer state onto the stack
and then reset the buffer.

Once we return from the proc, the original buffer state is copied
back.

The problem with this is that if any of the chunk has a mapped string
then they are not reachable by any Ruby object and may be garbage collected
at any moment.
casperisfine pushed a commit to Shopify/msgpack-ruby that referenced this issue Jul 12, 2023
What is happening is that when we call into a recursive packing
proc, we first save the packer buffer state onto the stack
and then reset the buffer.

Once we return from the proc, the original buffer state is copied
back.

The problem with this is that if any of the chunk has a mapped string
then they are not reachable by any Ruby object and may be garbage collected
at any moment.
casperisfine pushed a commit to Shopify/msgpack-ruby that referenced this issue Jul 12, 2023
What is happening is that when we call into a recursive packing
proc, we first save the packer buffer state onto the stack
and then reset the buffer.

Once we return from the proc, the original buffer state is copied
back.

The problem with this is that if any of the chunk has a mapped string
then they are not reachable by any Ruby object and may be garbage collected
at any moment.
casperisfine pushed a commit to Shopify/msgpack-ruby that referenced this issue Jul 13, 2023
What is happening is that when we call into a recursive packing
proc, we first save the packer buffer state onto the stack
and then reset the buffer.

Once we return from the proc, the original buffer state is copied
back.

The problem with this is that if any of the chunk has a mapped string
then they are not reachable by any Ruby object and may be garbage collected
at any moment.
casperisfine pushed a commit to Shopify/msgpack-ruby that referenced this issue Jul 13, 2023
What is happening is that when we call into a recursive packing
proc, we first save the packer buffer state onto the stack
and then reset the buffer.

Once we return from the proc, the original buffer state is copied
back.

The problem with this is that if any of the chunk has a mapped string
then they are not reachable by any Ruby object and may be garbage collected
at any moment.
@casperisfine
Copy link

I just released 1.7.2 with the fix for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants