Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building a static version on RHEL 6.x systems #15

Open
mgimelfarb opened this issue Feb 17, 2016 · 6 comments
Open

Building a static version on RHEL 6.x systems #15

mgimelfarb opened this issue Feb 17, 2016 · 6 comments

Comments

@mgimelfarb
Copy link

While there is a great writeup on building gcc 4.8.2 and then building bruce with it, I was wondering if there was a way to build bruce statically. If so, what would be needed to get that accomplished in RHEL 6.x and derivatives (OEL 6.6 in my case).

I've got a usecase of running bruce on a set of locked down machine with little remote access due to strict compliance issues, and looking for an easier deployment options.

@dspeterson
Copy link
Contributor

To build a statically linked bruce, you can tweak the SConstruct file as shown by the diff below. When I do this, I see the warnings shown below. I'm not sure what would be a good workaround for the getpwnam() and getaddrinfo() issues. Maybe an easier option would be to set up an RHEL 6.x development box and tweak the third line of the file centos6/gcc482.spec:

    %global _prefix /opt/gcc

to specify an alternate installation location for gcc 4.8.2. Assuming that you have write access to some directory, let's say /home/mgimelfarb, on the locked down machines, you might try cloning bruce's git repo into /home/mgimelfarb/bruce on your development box, and editing the above line so gcc 4.8.2 is installed in /home/mgimelfarb/bruce/opt/gcc. After building gcc 4.8.2 and bruce you could then make a tarball of the entire repo directory, transfer it to the locked down boxes, and try running it (remember to set environment variables PATH=/home/mgimelfarb/bruce/opt/gcc/bin:$PATH and LD_LIBRARY_PATH=/home/mgimelfarb/bruce/opt/gcc/lib64).

out/release/base/dynamic_lib.o: In function `Base::TDynamicLib::TDynamicLib(char const*, int)':
dynamic_lib.cc:(.text+0x15e): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
out/release/third_party/mongoose/mongoose.o: In function `mg_start':
mongoose.c:(.text+0x6c20): warning: Using 'getpwnam' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
out/release/socket/db/cursor.o: In function `Socket::Db::TCursor::TCursor(char const*, char const*, int, int, int, int)':
cursor.cc:(.text+0x43): warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/lib/../lib64/libpthread.a(libpthread.o): In function `sem_open':
(.text+0x77ad): warning: the use of `mktemp' is dangerous, better use `mkstemp'
out/release/third_party/mongoose/mongoose.o: In function `handle_proxy_request':
mongoose.c:(.text+0x32dc): warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking



diff --git a/SConstruct b/SConstruct
index 3c305fc..7f8e19b 100644
--- a/SConstruct
+++ b/SConstruct
@@ -177,7 +177,7 @@ def set_release_options():
     # 14.  :-(
     env.AppendUnique(CPPFLAGS=['-U_FORTIFY_SOURCE'])

-    env.AppendUnique(LINKFLAGS=['-rdynamic'])
+    env.AppendUnique(LINKFLAGS=['-rdynamic', '-static'])


 # Append 'mode' specific environment variables.

@mgimelfarb
Copy link
Author

@dspeterson, thank you for the instructions. I actually tried doing a build_all with the following command line:
./build_all -m release -p
and looks like the client libs failed building:

Building client_libs:

------------------------------------------------------------
Building target bruce/client/libbruce_client.a:

scons: Entering directory '/some_path/sources/bruce/bruce/bruce'
------------------------------------------------------------
------------------------------------------------------------
Building target bruce/client/libbruce_client.so:

scons: Entering directory '/some_path/sources/bruce/bruce/bruce'
/usr/bin/ld: /opt/gcc/lib/gcc/x86_64-unknown-linux-gnu/4.8.2/crtbeginT.o: relocation R_X86_64_32 against '__TMC_END__' can not be used when making a shared object; recompile with -fPIC
/opt/gcc/lib/gcc/x86_64-unknown-linux-gnu/4.8.2/crtbeginT.o: could not read symbols: Bad value
collect2: error: ld returned 1 exit status
scons: *** [out/release/bruce/client/libbruce_client.so] Error 1
Stopping on failure

Not sure if that's something that could be easily corrected, but wanted to mention it.
It does appear that to_bruce got built though.

Otherwise, I've been running some tests and so far, so good, but did get a segfault on bruce.test:

 ./bruce.test
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from TBruceTest
[ RUN      ] TBruceTest.SuccessfulDeliveryTest
[       OK ] TBruceTest.SuccessfulDeliveryTest (208 ms)
[ RUN      ] TBruceTest.KeyValueTest
[       OK ] TBruceTest.KeyValueTest (5215 ms)
[ RUN      ] TBruceTest.AckErrorTest
This part of the test is expected to take a while ...
Segmentation fault

Not looked into it yet.
Next goal is to set up a kafka mock server and run some load through bruce. Outside what's already provided here, is there a good way to do a load-test with the mock kafka server and the bruce client?

@dspeterson
Copy link
Contributor

Regarding the error building bruce/client/libbruce_client.so, it's probably because you're building a shared library, and specifying -static. Since you're statically linking, I would tweak the build so that libbruce_client.so isn't built. Regarding the seg fault, does it only occur when running unit tests on a statically linked bruce, or does it also occur when building in the usual manner without static linking?

@mgimelfarb
Copy link
Author

@dspeterson, while I didn't do what you asked me to just yet, I did do a static debug build.

./build_all -m debug -p
Interestingly enough, the same test didn't segfault.

./bruce.test
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from TBruceTest
[ RUN      ] TBruceTest.SuccessfulDeliveryTest
[       OK ] TBruceTest.SuccessfulDeliveryTest (240 ms)
[ RUN      ] TBruceTest.KeyValueTest
[       OK ] TBruceTest.KeyValueTest (237 ms)
[ RUN      ] TBruceTest.AckErrorTest
This part of the test is expected to take a while ...
[       OK ] TBruceTest.AckErrorTest (10251 ms)
[ RUN      ] TBruceTest.DisconnectTest
[       OK ] TBruceTest.DisconnectTest (5244 ms)
[ RUN      ] TBruceTest.MalformedMsgTest
[       OK ] TBruceTest.MalformedMsgTest (1239 ms)
[ RUN      ] TBruceTest.UnsupportedVersionMsgTest
[       OK ] TBruceTest.UnsupportedVersionMsgTest (1238 ms)
[ RUN      ] TBruceTest.CompressionTest
[       OK ] TBruceTest.CompressionTest (239 ms)
[----------] 7 tests from TBruceTest (18690 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test case ran. (18691 ms total)
[  PASSED  ] 7 tests.

BTW, no issues with shared library this time, which is somewhat inexplicable.

@dspeterson
Copy link
Contributor

I believe the segfault and shared library build failure aren't reproducing for you in a debug build because the debug version isn't actually getting linked statically. The diff I posted above showing how to tweak the SConstruct file for static linking only enables it for release builds. Sorry about that - it was an oversight on my part. To also enable static linking for debug builds, you need to make the same change to LINKFLAGS inside set_debug_options() as the diff does inside set_release_options().

When I do a debug build with static linking in my CentOS 6.7 development VM, I see the shared library build failure and also intermittent segfaults, which only occur when static linking is used. Looking at a core file I see that bruce is crashing inside a call to getaddrinfo() as shown in the stack trace below. I believe this is almost certainly related to the following warning emitted during the build:

out/release/socket/db/cursor.o: In function `Socket::Db::TCursor::TCursor(char const*, char const*, int, int, int, int)':
cursor.cc:(.text+0x43): warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

Given this issue with getaddrinfo() and static linking, I think your best bet is to link dynamically, and try the workaround I suggest above which involves editing centos6/gcc482.spec so you end up with a big tarball containing both bruce and gcc 4.8.2. Admittedly, it's not a very pretty solution but I think it will work.

About load testing with bruce and the mock kafka server, take a look at the command line args for the to_bruce command. There are options for sending a large number of messages in rapid succession, with an adjustable delay between messages. I recommend running the mock kafka server with several topics specified in its config, and setting up bruce's config so the topics have different batching and compression configs. Then run several instances of to_bruce concurrently so that bruce gets a mixture of messages with different topics. I also recommend doing similar load testing using real kafka brokers rather than mock kafka. Although the ideal thing is to set up a real cluster of broker machines, you can get by with only one machine if you are in a hurry or don't have access to an actual cluster. I often do this by running several kafka brokers in my development VM, each with its own config file, edited so that the brokers listen on different ports. If you need help setting this up, let me know and I can provide assistance. Good luck :-)

(gdb) bt
#0  0x0000003073070cda in fgets_unlocked () from /lib64/libc.so.6
#1  0x00007f30532b96cf in _nss_files_gethostbyname2_r ()
   from /lib64/libnss_files.so.2
#2  0x00000000006d050b in gethostbyname2_r ()
#3  0x00000000006c39de in gaih_inet ()
#4  0x00000000006c5b30 in getaddrinfo ()
#5  0x000000000058b35f in Socket::Db::TCursor::TCursor(char const*, char const*, int, int, int, int) () at src/socket/db/cursor.cc:38
#6  0x000000000056c0f1 in Bruce::Util::ConnectToHost(char const*, unsigned short, Base::TFd&) () at src/bruce/util/connect_to_host.cc:40
#7  0x00000000005317a5 in Bruce::Util::ConnectToHost(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned short, Base::TFd&)
    () at src/bruce/util/connect_to_host.h:42
#8  0x000000000052e53c in Bruce::MsgDispatch::TConnector::DoConnect() ()
    at src/bruce/msg_dispatch/connector.cc:267
#9  0x000000000052e76b in Bruce::MsgDispatch::TConnector::ConnectToBroker() ()
    at src/bruce/msg_dispatch/connector.cc:295
#10 0x00000000005304ba in Bruce::MsgDispatch::TConnector::DoRun() ()
    at src/bruce/msg_dispatch/connector.cc:821
#11 0x000000000052e299 in Bruce::MsgDispatch::TConnector::Run() ()
    at src/bruce/msg_dispatch/connector.cc:232
#12 0x00000000005c041b in Thread::TFdManagedThread::RunAndTerminate() ()
    at src/thread/fd_managed_thread.cc:135
#13 0x00000000005c2201 in _ZNKSt7_Mem_fnIMN6Thread16TFdManagedThreadEFvvEEclIJEvEEvPS1_DpOT_ () at /opt/gcc/include/c++/4.8.2/functional:601
#14 0x00000000005c2160 in _ZNSt5_BindIFSt7_Mem_fnIMN6Thread16TFdManagedThreadEFvvEEPS2_EE6__callIvJEJLm0EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE ()
    at /opt/gcc/include/c++/4.8.2/functional:1296
#15 0x00000000005c20d2 in _ZNSt5_BindIFSt7_Mem_fnIMN6Thread16TFdManagedThreadEFvvEEPS2_EEclIJEvEET0_DpOT_ () at /opt/gcc/include/c++/4.8.2/functional:1355
#16 0x00000000005c2064 in _ZNSt12_Bind_simpleIFSt5_BindIFSt7_Mem_fnIMN6Thread16TFdManagedThreadEFvvEEPS3_EEvEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE ()
    at /opt/gcc/include/c++/4.8.2/functional:1732
#17 0x00000000005c1fb1 in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (Thread::TFdManagedThread::*)()> ()(Thread::TFdManagedThread*)> ()()>::operator()() ()
    at /opt/gcc/include/c++/4.8.2/functional:1720
#18 0x00000000005c1f4a in std::thread::_Impl<std::_Bind_simple<std::_Bind<std::_Mem_fn<void (Thread::TFdManagedThread::*)()> ()(Thread::TFdManagedThread*)> ()()> >::_M_run() () at /opt/gcc/include/c++/4.8.2/thread:115
#19 0x0000000000642e10 in execute_native_thread_routine ()
#20 0x00000000005efae4 in start_thread ()
#21 0x00000000006ca719 in clone ()
(gdb) 

@dspeterson
Copy link
Contributor

When load testing with real kafka brokers, a good stress test is to shut down a broker while sending messages through bruce. Then wait a while, start the broker again, and run the kafka-preferred-replica-election.sh tool that comes with kafka to reassign partition leadership. In addition to doing a clean shutdown, try an unclean shutdown (i.e. kill -9) to simulate a broker crash. While performing these steps, you can watch bruce's log messages to see how it handles the failures. Also try shutting down more than one broker. You can try this with different workloads, including multiple client machines sending messages through bruce, to get an idea of the cluster's performance characteristics and ability to tolerate and recover from failures. While doing this, try running some consumers and check the message flow to verify that there is no message loss.

When one or more brokers are down, use bruce's web interface to see if message backlogs develop, and if so how quickly. If messages start getting backlogged, see how quickly the backlogs disappear after bringing broker(s) back up. This will help you decide whether your cluster is adequately provisioned to tolerate failures, how large a value to specify for bruce's --msg_buffer_max option, and whether enough message batching is being performed. Bruce's batching configuration has a large influence on kafka cluster performance, and inadequate batching even for a single relatively low volume topic can quickly degrade performance. The sensitivity of kafka cluster performance to reduced batching is greater than one might expect just from considering physical factors such as network latency and disk performance. When batching was greatly decreased or eliminated, we discovered by running the brokers with strace that they spent a good deal of time repeatedly adding and removing file descriptors to/from an epoll set, which is a slow operation. Modifying kafka's implementation to avoid this behavior would likely be difficult since it is written in Scala, and therefore relies on Scala APIs for simultaneously monitoring multiple TCP connections, rather than directly making epoll-related system calls as can be done in C or C++ code. Kafka's design relies heavily on batching to achieve good performance, so this sort of problem that occurs only when batching is nearly or completely disabled shouldn't be viewed as a major issue, but it's good to be aware of. If there are topics for which very low latency is desired, you may have to make a tradeoff between low latency and broker cluster performance.

At if(we), we replaced a legacy data pipeline with a new system consisting of bruce, kafka, and consumers (both internally developed at if(we) and open source consumers developed elsewhere). Before deploying to production, we did extensive testing of this sort in our QA environment to gain confidence in the integrity of the new data pipeline. We ran both old and new data pipelines in production for a while to facilitate a gradual switchover and provide a fallback option in case unexpected problems occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants