-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to GC mark already marked object #341
Comments
The C backtrace points at
Have you tried |
A reproduction would certainly help, but perhaps we'll be able to figure out the bug by auditing the code. |
Yes, tried and keep experimenting adding more variety into the test case. But so far experimenting on my local machines haven't succeed. |
No worries, thanks for trying. I'll try to have a look at the mark function tomorrow. |
It appears that you are on Ruby 2.7, which is EOL, so it may have bugs that have not been fixed. Can you try upgrading Ruby to see if this issue still occurs? |
Oh, no, currently it is not possible. For sure, if there was a way to reproduce the issue, I'd have tried a newer ruby, but crashes happen in production env only. |
Ok, so just braindumping what the backtrace tells us. GC triggers when the "output" string is instantiated: VALUE msgpack_buffer_all_as_string(msgpack_buffer_t* b)
{
if(b->head == &b->tail) {
return _msgpack_buffer_head_chunk_as_string(b);
}
size_t length = msgpack_buffer_all_readable_size(b);
VALUE string = rb_str_new(NULL, length); // <--- HERE
char* buffer = RSTRING_PTR(string); Then a void msgpack_buffer_mark(void *ptr)
{
msgpack_buffer_t* b = ptr;
/* head is always available */
msgpack_buffer_chunk_t* c = b->head;
while(c != &b->tail) {
rb_gc_mark(c->mapped_string); // <---- HERE
c = c->next;
}
rb_gc_mark(c->mapped_string);
rb_gc_mark(b->io);
rb_gc_mark(b->io_buffer);
} So far I've tried to add an explicit I'll try to audit the |
Fix: msgpack#341 These struct contain a VALUE reference so if we don't zero it out, it could be pointing at a T_NONE or some other old object slot. Especially since we can re-use existing chunks.
Ok, so it's purely theoretical, but I think I found a way such crash could happen: #342 @pavel-workato could you try this branch? gem "msgpack", github: "https://github.com/msgpack/msgpack-ruby/pull/342" |
@casperisfine Thank you for so prompt response! I took a look at the fix, and indeed it makes sense to me. |
No worries, that wouldn't be the first time we have to wait a while before getting a confirmation :) |
Reopening as the fix is only tentative for now. |
@casperisfine I am trying to understand how #342 could affect what @pavel-workato saw in his stack trace and I am struggling to find out the flow, by looking at code, where non-zeroed mapped-string from the chunk appears right before the call to As I understand it, the idea is that dirty mapped-string VALUE must appear in the list tail before writing it with the correct value and the call to GC, so we must see something like that in the code
but by looking at all call sites of Can you share if you see how the reused dirty chunk could appear in the code to satisfy the stack trace we see and the related fix? |
So my theory on what is happening is: First a chunk is used with a msgpack_buffer_chunk_t *chunk = _msgpack_buffer_chunk_malloc(...)
chunk->mapped_string = rb_buf_new(...) And then that chunk is released, hence added to the free list (via Then later on, we allocate a new chunk again, and that old one is re-used, still pointing the the old
That is true for _msgpack_buffer_add_new_chunk(b);
char* data = RSTRING_PTR(mapped_string);
size_t length = RSTRING_LEN(mapped_string);
b->tail.first = (char*) data;
b->tail.last = (char*) data + length;
b->tail.mapped_string = mapped_string;
b->tail.mem = NULL; No possible allocation here between The other call site ( /* allocate new chunk */
_msgpack_buffer_add_new_chunk(b);
char* mem = _msgpack_buffer_chunk_malloc(b, &b->tail, length, &capacity); // GC may trigger here So if you wish to reproduct this, you need quite a few stars to align:
I'd love a repro, but it seems close to impossible to get all this happening in a controlled way. But perhaps by creating the required conditions and running it in a loop with some |
@casperisfine +1 - I also saw those 2 cases. What I am thinking of is that those 2 cases do not relate to the problem with Pavel's backtrace in |
Hum, true, but it could be from another thread? Are you @pavel-workato's coworker? If so, Does this code run in a multi-threaded environment (e.g. sidekiq or puma)? Since you use |
Hi! Indeed, there are some other auxiliary threads run in parallel, though the |
My question wasn't wether the instances are shared between threads, but whether multiple threads are using Because if so the |
Either way, even if we don't have a specific repro or culprit (which I'd love to have), we can deduce for sure |
@casperisfine @byroot hi! we have tested your patch #342 and it did not help. it still crashes. luckily, we have a local system in which the crash reliably reproduces. its a relatively large amount of data. however, running the msgpack serialization test purely on that data outside of our system env does not reproduce the bug. it however should not be related to our system as it reliably works under very heavy load without message pack serialization. in our local tests we have added sync GC call right before VALUE msgpack_buffer_all_as_string(msgpack_buffer_t* b)
{
if(b->head == &b->tail) {
return _msgpack_buffer_head_chunk_as_string(b);
}
size_t length = msgpack_buffer_all_readable_size(b);
rb_eval_string("GC.start full_mark: true, immediate_mark: true, immediate_sweep: true"); // <--- crashes in our beloved (in mark) place
VALUE string = rb_str_new(NULL, length); we had success to extract the state of the buffer chunks list right before running into a T_NONE object:
where the format is: as you can see, we have a very strange distribution of NO_MAPPED_STRINGs, alive mapped_strings and T_NONE strings. maybe you will have an idea about how it could happen. the last test we have tried today was to disable the GC completely but this explicit sync call that we added and measure the T_NONE objects before and after. it turned out that there were none of the T_NONE objects appeared on the list right after the full sync GC calls. so we can conclude it somehow happens in other way. any ideas? |
Thank you for the extra data, I'll will have another look first thing tomorrow. You pinpointing exactly where the GC triggers will certainly help. |
Hum, I'm starting to think the assertion in #342 was correct, but that the fix was wrong: Looking at msgpack_buffer_chunk_t* nc = _msgpack_buffer_alloc_new_chunk(b);
if(b->rmem_last == b->tail_buffer_end) {
/* reuse unused rmem space */
size_t unused = b->tail_buffer_end - b->tail.last;
b->rmem_last -= unused;
}
/* rebuild tail */
*nc = b->tail;
before_tail->next = nc;
nc->next = &b->tail;
|
What is happening is that when we call into a recursive packing proc, we first save the packer buffer state onto the stack and then reset the buffer. Once we return from the proc, the original buffer state is copied back. The problem with this is that if any of the chunk has a mapped string then they are not reachable by any Ruby object and may be garbage collected at any moment.
What is happening is that when we call into a recursive packing proc, we first save the packer buffer state onto the stack and then reset the buffer. Once we return from the proc, the original buffer state is copied back. The problem with this is that if any of the chunk has a mapped string then they are not reachable by any Ruby object and may be garbage collected at any moment.
What is happening is that when we call into a recursive packing proc, we first save the packer buffer state onto the stack and then reset the buffer. Once we return from the proc, the original buffer state is copied back. The problem with this is that if any of the chunk has a mapped string then they are not reachable by any Ruby object and may be garbage collected at any moment.
What is happening is that when we call into a recursive packing proc, we first save the packer buffer state onto the stack and then reset the buffer. Once we return from the proc, the original buffer state is copied back. The problem with this is that if any of the chunk has a mapped string then they are not reachable by any Ruby object and may be garbage collected at any moment.
What is happening is that when we call into a recursive packing proc, we first save the packer buffer state onto the stack and then reset the buffer. Once we return from the proc, the original buffer state is copied back. The problem with this is that if any of the chunk has a mapped string then they are not reachable by any Ruby object and may be garbage collected at any moment.
I just released |
Hello!
When calling MsgPack 1.7.0 we occasionally see following crashes in production. ~ 1 crash on every 1 million calls.
The underlying ruby code is
Before die VM reports this "[BUG] try to mark T_NONE object".
Obvious thing is that buffer chunk somehow stores already marked
mapped_string
, but it is not clear why. Hope to find some help, insights or suggestion from you.msgpack_buffer_mark
.Thanks!
The text was updated successfully, but these errors were encountered: