Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent memory leak for req/rep send/recv. #2602

Closed
Matthew-Jemielity opened this issue Jun 21, 2017 · 23 comments
Closed

intermittent memory leak for req/rep send/recv. #2602

Matthew-Jemielity opened this issue Jun 21, 2017 · 23 comments

Comments

@Matthew-Jemielity
Copy link

Matthew-Jemielity commented Jun 21, 2017

Hi,

While unit testing my project using zmq I found randomly occuring memory leak for a REQ/REP socket pair doing zmq_send{,_const} and zmq_recv. I'm attaching an almost-minimal self-contained program which reproduces this issue:
zmq_recv_req_rep.txt
Hopefully I'm using libzmq correctly.
There seems to be some kind of race condition involved, since checking with valgrind can have four different outcomes:

  • no error (often)
  • memleak only (very rare)
  • invalid read + write only (sometimes)
  • invalid read + write and memleak (rare)

The errors seem easier to reproduce on 32-bit system.

ZMQ I'm using was compiled from source, commit 2d83acc , through ./autogen.sh && ./configure --prefix=/usr && make && make check && sudo make install .
Tested on systems:

  1. Ubuntu 14.04.05 64-bit: 4.2.0-42-lowlatency #49~14.04.1-Ubuntu SMP PREEMPT Wed Jun 29 21:21:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux, gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4, valgrind-3.10.1
  2. qemu vm with Ubuntu 14.04.05 32-bit: 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:06:37 UTC 2016 i686 i686 i686 GNU/Linux, gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4, valgrind-3.10.1

Source was compiled with:
gcc -Wall -Wextra -std=c99 -pedantic zmq_recv_req_rep.c -lzmq -pthread -o test
and checked with:
valgrind --leak-check=full ./test

valigrind message:

==18183== Memcheck, a memory error detector
==18183== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==18183== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==18183== Command: ./test
==18183== 
==18183== Thread 2:
==18183== Invalid write of size 1
==18183==    at 0x4E6D239: zmq::pipe_t::terminate(bool) (pipe.cpp:393)
==18183==    by 0x4E817C7: zmq::socket_base_t::term_endpoint(char const*) (socket_base.cpp:1086)
==18183==    by 0x400ED6: receiving_thread (in [...]/zmq_memleak/test)
==18183==    by 0x52DB183: start_thread (pthread_create.c:312)
==18183==    by 0x55EEFFC: clone (clone.S:111)
==18183==  Address 0x6f2cbdc is 156 bytes inside a block of size 192 free'd
==18183==    at 0x4C2C2BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18183==    by 0x4E6C86E: zmq::pipe_t::process_pipe_term_ack() (pipe.cpp:377)
==18183==    by 0x4E7BE3B: zmq::socket_base_t::process_commands(int, bool) (socket_base.cpp:1367)
==18183==    by 0x4E7C93E: zmq::socket_base_t::recv(zmq::msg_t*, int) (socket_base.cpp:1240)
==18183==    by 0x4E991C8: s_recvmsg(zmq::socket_base_t*, zmq_msg_t*, int) (zmq.cpp:497)
==18183==    by 0x4E9924C: zmq_recv (zmq.cpp:524)
==18183==    by 0x400D91: receiving_thread (in [...]/zmq_memleak/test)
==18183==    by 0x52DB183: start_thread (pthread_create.c:312)
==18183==    by 0x55EEFFC: clone (clone.S:111)
==18183== 
==18183== Invalid read of size 4
==18183==    at 0x4E6D250: zmq::pipe_t::terminate(bool) (pipe.cpp:396)
==18183==    by 0x4E817C7: zmq::socket_base_t::term_endpoint(char const*) (socket_base.cpp:1086)
==18183==    by 0x400ED6: receiving_thread (in [...]/zmq_memleak/test)
==18183==    by 0x52DB183: start_thread (pthread_create.c:312)
==18183==    by 0x55EEFFC: clone (clone.S:111)
==18183==  Address 0x6f2cbd8 is 152 bytes inside a block of size 192 free'd
==18183==    at 0x4C2C2BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18183==    by 0x4E6C86E: zmq::pipe_t::process_pipe_term_ack() (pipe.cpp:377)
==18183==    by 0x4E7BE3B: zmq::socket_base_t::process_commands(int, bool) (socket_base.cpp:1367)
==18183==    by 0x4E7C93E: zmq::socket_base_t::recv(zmq::msg_t*, int) (socket_base.cpp:1240)
==18183==    by 0x4E991C8: s_recvmsg(zmq::socket_base_t*, zmq_msg_t*, int) (zmq.cpp:497)
==18183==    by 0x4E9924C: zmq_recv (zmq.cpp:524)
==18183==    by 0x400D91: receiving_thread (in [...]/zmq_memleak/test)
==18183==    by 0x52DB183: start_thread (pthread_create.c:312)
==18183==    by 0x55EEFFC: clone (clone.S:111)
==18183== 
==18183== 
==18183== HEAP SUMMARY:
==18183==     in use at exit: 472 bytes in 12 blocks
==18183==   total heap usage: 275 allocs, 263 frees, 327,666 bytes allocated
==18183== 
==18183== Thread 1:
==18183== 472 (56 direct, 416 indirect) bytes in 1 blocks are definitely lost in loss record 11 of 11
==18183==    at 0x4C2B3B0: operator new(unsigned long, std::nothrow_t const&) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18183==    by 0x4E8A760: zmq::stream_engine_t::mechanism_ready() (stream_engine.cpp:866)
==18183==    by 0x4E8A947: zmq::stream_engine_t::next_handshake_command(zmq::msg_t*) (stream_engine.cpp:771)
==18183==    by 0x4E860A7: zmq::stream_engine_t::out_event() (stream_engine.cpp:382)
==18183==    by 0x4E5C729: zmq::epoll_t::loop() (epoll.cpp:184)
==18183==    by 0x4E8FE64: thread_routine (thread.cpp:106)
==18183==    by 0x52DB183: start_thread (pthread_create.c:312)
==18183==    by 0x55EEFFC: clone (clone.S:111)
==18183== 
==18183== LEAK SUMMARY:
==18183==    definitely lost: 56 bytes in 1 blocks
==18183==    indirectly lost: 416 bytes in 11 blocks
==18183==      possibly lost: 0 bytes in 0 blocks
==18183==    still reachable: 0 bytes in 0 blocks
==18183==         suppressed: 0 bytes in 0 blocks
==18183== 
==18183== For counts of detected and suppressed errors, rerun with: -v
==18183== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

Regards,
Matt

@bluca
Copy link
Member

bluca commented Jun 21, 2017

Your program is both connecting and binding the same socket:

  int const connect_result = zmq_connect(socket, d->address);
  assert(-1 != connect_result);
  int const bind_result = zmq_bind(socket, d->address);
  assert(-1 != bind_result);

That's undefined behaviour - can you try to repro without that please?

@Matthew-Jemielity
Copy link
Author

Yeah, that was a copypaste error from sending_thread() I overlooked, sorry.
With that zmq_connect() removed in receiving_thread() I no longer get invalid read/write, but I still get memleak. This explains why I only got the memleak in the actual project I constructed the minimal case from.

Corrected minimal case:
zmq_recv_req_rep.txt

valgrind output:

==3772== Memcheck, a memory error detector
==3772== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==3772== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==3772== Command: ./test
==3772== 
==3772== 
==3772== HEAP SUMMARY:
==3772==     in use at exit: 73,176 bytes in 13 blocks
==3772==   total heap usage: 200 allocs, 187 frees, 262,316 bytes allocated
==3772== 
==3772== 472 (56 direct, 416 indirect) bytes in 1 blocks are definitely lost in loss record 11 of 12
==3772==    at 0x4C2B3B0: operator new(unsigned long, std::nothrow_t const&) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3772==    by 0x4E8A380: zmq::stream_engine_t::mechanism_ready() (stream_engine.cpp:866)
==3772==    by 0x4E8A68F: zmq::stream_engine_t::process_handshake_command(zmq::msg_t*) (stream_engine.cpp:799)
==3772==    by 0x4E878DE: zmq::stream_engine_t::in_event() (stream_engine.cpp:344)
==3772==    by 0x4E5C47D: zmq::epoll_t::loop() (epoll.cpp:188)
==3772==    by 0x4E8FA84: thread_routine (thread.cpp:106)
==3772==    by 0x50C5183: start_thread (pthread_create.c:312)
==3772==    by 0x53D8BEC: clone (clone.S:111)
==3772== 
==3772== LEAK SUMMARY:
==3772==    definitely lost: 56 bytes in 1 blocks
==3772==    indirectly lost: 416 bytes in 11 blocks
==3772==      possibly lost: 0 bytes in 0 blocks
==3772==    still reachable: 72,704 bytes in 1 blocks
==3772==         suppressed: 0 bytes in 0 blocks
==3772== Reachable blocks (those to which a pointer was found) are not shown.
==3772== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3772== 
==3772== For counts of detected and suppressed errors, rerun with: -v
==3772== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

@Matthew-Jemielity Matthew-Jemielity changed the title intermittent memory leak + invalid read/write for req/rep send/recv. intermittent memory leak for req/rep send/recv. Jun 21, 2017
@bluca
Copy link
Member

bluca commented Jun 21, 2017

I can reproduce but like 1 out of 50 times, it will be a bit of a pain to narrow down (PRs welcome!)

@Matthew-Jemielity
Copy link
Author

Matthew-Jemielity commented Jun 22, 2017

If I move zmq context operations to main() and pass ctx to threads ( zmq_recv_req_rep_v2.txt ), valgrind no longer reports any leaks (or at least it's less than 1 in 100 runs). Looks like zmq_ctx_{new,destroy}() shouldn't be used simultaneously in multiple threads.

@bluca
Copy link
Member

bluca commented Jun 22, 2017

Using a single context for the life time of the application is recommended practice, yes.

But it should still work fine - perhaps it's a linger issue. Could you please try setting ZMQ_LINGER to -1 on both sockets and try again?

@Matthew-Jemielity
Copy link
Author

Matthew-Jemielity commented Jun 22, 2017

With linger set (
zmq_recv_req_rep_v3.txt ) the leak is much harder to reproduce. Couldn't find it on 64-bit Ubuntu, but did find it on 32-bit, same backtrace.

@bluca
Copy link
Member

bluca commented Jun 22, 2017

It's metadata, I wonder if it's related to #2567
Are you sure it's running with the latest version from git?

@Matthew-Jemielity
Copy link
Author

Matthew-Jemielity commented Jun 22, 2017

$ git show --oneline HEAD
f5da4b1 Merge pull request #2603 from bluca/xpub_manual_subs
$ gcc -xc <(echo -e "#include <stdio.h>\n#include <zmq.h>\nint main(void) { int major, minor, patch; zmq_version(&major, &minor, &patch); printf(\"%d.%d.%d\\\\n\", major, minor, patch); return 0; }\n\n") -lzmq && ./a.out && rm a.out
4.2.3
$ gcc -Wall -Wextra -std=c99 -pedantic zmq_recv_req_rep_v3.c -lzmq -pthread -o test
$ valgrind --leak-check=full ./test
==11231== Memcheck, a memory error detector
==11231== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==11231== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==11231== Command: ./test
==11231== 
==11231== 
==11231== HEAP SUMMARY:
==11231==     in use at exit: 264 bytes in 12 blocks
==11231==   total heap usage: 199 allocs, 187 frees, 163,148 bytes allocated
==11231== 
==11231== 264 (28 direct, 236 indirect) bytes in 1 blocks are definitely lost in loss record 11 of 11
==11231==    at 0x402A9AC: operator new(unsigned int, std::nothrow_t const&) (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==11231==    by 0x409FD0D: zmq::stream_engine_t::mechanism_ready() (stream_engine.cpp:866)
==11231==    by 0x40A008F: zmq::stream_engine_t::process_handshake_command(zmq::msg_t*) (stream_engine.cpp:799)
==11231==    by 0x409CC83: zmq::stream_engine_t::in_event() (stream_engine.cpp:344)
==11231==    by 0x406B978: zmq::epoll_t::loop() (epoll.cpp:188)
==11231==    by 0x40A5D8F: thread_routine (thread.cpp:106)
==11231==    by 0x40DCF71: start_thread (pthread_create.c:312)
==11231==    by 0x41E03ED: clone (clone.S:129)
==11231== 
==11231== LEAK SUMMARY:
==11231==    definitely lost: 28 bytes in 1 blocks
==11231==    indirectly lost: 236 bytes in 11 blocks
==11231==      possibly lost: 0 bytes in 0 blocks
==11231==    still reachable: 0 bytes in 0 blocks
==11231==         suppressed: 0 bytes in 0 blocks
==11231== 
==11231== For counts of detected and suppressed errors, rerun with: -v
==11231== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

@bluca
Copy link
Member

bluca commented Jun 23, 2017

Ok, thanks.

To me this smells like a possible race condition at context shutdown - which means if you use the same context for the whole lifetime of your application it should not have an impact.

I'll try and reproduce more and investigate, help is of course welcome.

@bjovke
Copy link
Contributor

bjovke commented Jun 26, 2017

I've also managed to reproduce it.
I've modified the code a little bit by using C++11 and compiled it with g++ 6 (test.txt).
And also now it's compilable and runnable on Windows.
I've added a sync to launch sender thread only when receiver is ready to accept connections.

This is what I've got on second run (Ubuntu 14, same command line):

g++ -std=c++11 -Wall -Wextra test.cpp -pedantic -pthread -o test -L /usr/local/lib -l zmq
valgrind --leak-check=full ./test

==24612== Memcheck, a memory error detector
==24612== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==24612== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==24612== Command: ./test
==24612==
==24612==
==24612== HEAP SUMMARY:
==24612==     in use at exit: 73,176 bytes in 13 blocks
==24612==   total heap usage: 204 allocs, 191 frees, 262,348 bytes allocated
==24612==
==24612== 472 (56 direct, 416 indirect) bytes in 1 blocks are definitely lost in loss record 11 of 12
==24612==    at 0x4C2B3B0: operator new(unsigned long, std::nothrow_t const&) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24612==    by 0x4E8BC00: zmq::stream_engine_t::mechanism_ready() (stream_engine.cpp:866)
==24612==    by 0x4E8BF17: zmq::stream_engine_t::process_handshake_command(zmq::msg_t*) (stream_engine.cpp:799)
==24612==    by 0x4E88F0C: zmq::stream_engine_t::in_event() (stream_engine.cpp:344)
==24612==    by 0x4E5B91D: zmq::epoll_t::loop() (epoll.cpp:188)
==24612==    by 0x4E91B98: thread_routine (thread.cpp:106)
==24612==    by 0x55F1183: start_thread (pthread_create.c:312)
==24612==    by 0x590137C: clone (clone.S:111)
==24612==
==24612== LEAK SUMMARY:
==24612==    definitely lost: 56 bytes in 1 blocks
==24612==    indirectly lost: 416 bytes in 11 blocks
==24612==      possibly lost: 0 bytes in 0 blocks
==24612==    still reachable: 72,704 bytes in 1 blocks
==24612==         suppressed: 0 bytes in 0 blocks
==24612== Reachable blocks (those to which a pointer was found) are not shown.
==24612== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==24612==
==24612== For counts of detected and suppressed errors, rerun with: -v
==24612== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

@bjovke
Copy link
Contributor

bjovke commented Jun 26, 2017

metadata of stream_engine_t is allocated with new in stream_engine.cpp, line 866.
stream_engine_t's destructor has this:

    //  Drop reference to metadata and destroy it if we are
    //  the only user.
    if (metadata != NULL) {
        if (metadata->drop_ref ()) {
            LIBZMQ_DELETE(metadata);
        }
    }

It seems that reference count of metadata is still greater than 0 after drop_ref() so metadata is not freed.
And there was no other opportunity to free metadata.

@bjovke
Copy link
Contributor

bjovke commented Jun 26, 2017

Definitely one drop_ref () is sometimes missing and metadata never gets freed.
I've put printf() 's into add_ref () and drop_ref () functions and here's the result.
There are five "Ref dropped" when it's ok and 4 when it's not:

valgrind --leak-check=full ./test

==26092== Memcheck, a memory error detector
==26092== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==26092== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==26092== Command: ./test
==26092==
Ref dropped, this: 0x0x611c7e0
Ref added, this: 0x0x611d300
Ref added, this: 0x0x611d300
Ref added, this: 0x0x611d300
Ref dropped, this: 0x0x611d300
Ref dropped, this: 0x0x611d300
Ref dropped, this: 0x0x611d300
Ref dropped, this: 0x0x611d300
==26092==
==26092== HEAP SUMMARY:
==26092==     in use at exit: 72,704 bytes in 1 blocks
==26092==   total heap usage: 207 allocs, 206 frees, 262,412 bytes allocated
==26092==
==26092== LEAK SUMMARY:
==26092==    definitely lost: 0 bytes in 0 blocks
==26092==    indirectly lost: 0 bytes in 0 blocks
==26092==      possibly lost: 0 bytes in 0 blocks
==26092==    still reachable: 72,704 bytes in 1 blocks
==26092==         suppressed: 0 bytes in 0 blocks
==26092== Reachable blocks (those to which a pointer was found) are not shown.
==26092== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==26092==
==26092== For counts of detected and suppressed errors, rerun with: -v
==26092== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)


valgrind --leak-check=full ./test

==26099== Memcheck, a memory error detector
==26099== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==26099== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==26099== Command: ./test
==26099==
Ref added, this: 0x0x611d2c0
Ref added, this: 0x0x611d2c0
Ref dropped, this: 0x0x611c7f0
Ref added, this: 0x0x611d2c0
Ref dropped, this: 0x0x611d2c0
Ref dropped, this: 0x0x611d2c0
Ref dropped, this: 0x0x611d2c0
==26099==
==26099== HEAP SUMMARY:
==26099==     in use at exit: 73,176 bytes in 13 blocks
==26099==   total heap usage: 204 allocs, 191 frees, 262,348 bytes allocated
==26099==
==26099== 472 (56 direct, 416 indirect) bytes in 1 blocks are definitely lost in loss record 11 of 12
==26099==    at 0x4C2B3B0: operator new(unsigned long, std::nothrow_t const&) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==26099==    by 0x4E8BC80: zmq::stream_engine_t::mechanism_ready() (stream_engine.cpp:866)
==26099==    by 0x4E8BF97: zmq::stream_engine_t::process_handshake_command(zmq::msg_t*) (stream_engine.cpp:799)
==26099==    by 0x4E88F8C: zmq::stream_engine_t::in_event() (stream_engine.cpp:344)
==26099==    by 0x4E5B96D: zmq::epoll_t::loop() (epoll.cpp:188)
==26099==    by 0x4E91C18: thread_routine (thread.cpp:106)
==26099==    by 0x55F1183: start_thread (pthread_create.c:312)
==26099==    by 0x590137C: clone (clone.S:111)
==26099==
==26099== LEAK SUMMARY:
==26099==    definitely lost: 56 bytes in 1 blocks
==26099==    indirectly lost: 416 bytes in 11 blocks
==26099==      possibly lost: 0 bytes in 0 blocks
==26099==    still reachable: 72,704 bytes in 1 blocks
==26099==         suppressed: 0 bytes in 0 blocks
==26099== Reachable blocks (those to which a pointer was found) are not shown.
==26099== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==26099==
==26099== For counts of detected and suppressed errors, rerun with: -v
==26099== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

@bjovke bjovke closed this as completed Jun 26, 2017
@bjovke bjovke reopened this Jun 26, 2017
@bjovke
Copy link
Contributor

bjovke commented Jun 26, 2017

Sorry for close/reopen, wrong button...

@bluca
Copy link
Member

bluca commented Jun 26, 2017

If you also add a call to zmq::print_backtrace (from err.hpp) you'll see where they are made (install and build with libunwind-dev)

@bjovke
Copy link
Contributor

bjovke commented Jun 27, 2017

@bluca Great! I'll try that. Then we will find the missing one.
I've looked at the code a little bit and this metadata member is a perfect candidate for "shared pointer".
I'm thinking how it can be transformed into exactly that, with a simple implementation inside libzmq.
But there are heavy dependencies on this between various modules and a lot of changes need to be made in that case.
Maybe it's simpler just to fix this missing drop_ref (), if it's possible at all?

@bluca
Copy link
Member

bluca commented Jun 27, 2017

I would recommend a small and simple fix. Doing sweeping refactoring across the core library is tricky and risky.

@bjovke
Copy link
Contributor

bjovke commented Jun 27, 2017

This is the missing drop_ref (). When it's present there's no leak.
It seems that reaper thread does not receive reap command or it terminates prematurely.

#0  0x7ffb02de77f0 in /usr/local/lib/libzmq.so.5 (zmq::metadata_t::drop_ref()+0x1f)
#1  0x7ffb02de84cf in /usr/local/lib/libzmq.so.5 (zmq::msg_t::close()+0x32)
#2  0x7ffb02defe02 in /usr/local/lib/libzmq.so.5 (zmq::pipe_t::rollback() [clone .part.3]+0x38)
#3  0x7ffb02df07a8 in /usr/local/lib/libzmq.so.5 (zmq::pipe_t::terminate(bool)+0xa6)
#4  0x7ffb02e062f6 in /usr/local/lib/libzmq.so.5 (zmq::socket_base_t::process_term(int)+0x3f)
#5  0x7ffb02e0636f in /usr/local/lib/libzmq.so.5 (zmq::socket_base_t::start_reaping(zmq::epoll_t*)+0xd0)
#6  0x7ffb02df7f70 in /usr/local/lib/libzmq.so.5 (zmq::reaper_t::process_reap(zmq::socket_base_t*)+0x13)
#7  0x7ffb02df8163 in /usr/local/lib/libzmq.so.5 (zmq::reaper_t::in_event()+0xbc)
#8  0x7ffb02dded0c in /usr/local/lib/libzmq.so.5 (zmq::epoll_t::loop()+0xde)
#9  0x7ffb02e153de in /usr/local/lib/libzmq.so.5 (thread_routine+0x39)
#10  0x7ffb0267a0f9 in /lib/x86_64-linux-gnu/libpthread.so.0 (start_thread+0xc4)
#11  0x7ffb023a740d in /lib/x86_64-linux-gnu/libc.so.6 (clone+0x6d)

Edit: actually reaper always receives reap but step 2 of this stack trace is never reached when there's leak.

@bluca
Copy link
Member

bluca commented Jun 27, 2017

So it is a race at shutdown. Does it still happen with linger at -1 ?

@bjovke
Copy link
Contributor

bjovke commented Jun 27, 2017

@bluca Yes, it still happens. This is the test I used. It happened 4 out of 15 runs.
Interestingly, I haven't managed to reproduce this on Windows, no matter how much I try.
But there's a difference, on Windows I see that select_t is used and on Ubuntu epoll_t.

@bjovke
Copy link
Contributor

bjovke commented Jun 27, 2017

Windows stack trace:

test_1.exe!zmq::metadata_t::drop_ref() Line 55	C++
test_1.exe!zmq::msg_t::close() Line 261	C++
test_1.exe!zmq::pipe_t::rollback() Line 242	C++
test_1.exe!zmq::pipe_t::terminate(bool delay_) Line 443	C++
test_1.exe!zmq::socket_base_t::process_term(int linger_) Line 1410	C++
test_1.exe!zmq::own_t::terminate() Line 145	C++
test_1.exe!zmq::socket_base_t::start_reaping(zmq::select_t * poller_) Line 1325	C++
test_1.exe!zmq::reaper_t::process_reap(zmq::socket_base_t * socket_) Line 128	C++
test_1.exe!zmq::object_t::process_command(zmq::command_t & cmd_) Line 138	C++
test_1.exe!zmq::reaper_t::in_event() Line 98	C++
test_1.exe!zmq::select_t::loop() Line 317	C++
test_1.exe!zmq::select_t::worker_routine(void * arg_) Line 393	C++
test_1.exe!thread_routine(void * arg_) Line 46	C++
test_1.exe!invoke_thread_procedure(unsigned int(__stdcall*)(void *) procedure, void * const context) Line 92	C++
test_1.exe!thread_start<unsigned int (__stdcall*)(void *)>(void * const parameter) Line 115	C++
[External Code]	

@bjovke
Copy link
Contributor

bjovke commented Jun 27, 2017

Found it!
In zmq::pipe_t::terminate () there's this code:

    else if (state == waiting_for_delimiter && !delay) {
        outpipe = NULL;
        send_pipe_term_ack (peer);
        state = term_ack_sent;
    }

When leak happens this condition is true and outpipe becomes NULL.
After that there's this code in the same function:

    if (outpipe) {

        //  Drop any unfinished outbound messages.
        rollback ();

        //  Write the delimiter into the pipe. Note that watermarks are not
        //  checked; thus the delimiter can be written even when the pipe is full.
        msg_t msg;
        msg.init_delimiter ();
        outpipe->write (msg, false);
        flush ();
    }

rollback (), which eventually calls drop_ref () and frees metadata is never executed if outpipe is NULL and metadata is leaked, for one message in this case.

I've made a PR for this.

bluca added a commit that referenced this issue Jun 27, 2017
Problem: intermittent memory leak for req/rep send/recv. #2602 Solution: memory leak fixed.
@bluca
Copy link
Member

bluca commented Jun 27, 2017

Great stuff, thanks!

@Matthew-Jemielity could you please pull and try again?

Fixed by #2609

@bluca bluca closed this as completed Jun 27, 2017
@Matthew-Jemielity
Copy link
Author

I confirm my tests no longer leak with the patch. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants