-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault under running kernel's netfilter concurrency testing #1750
Comments
Per the gdb output (see below), it may be that the problem happens when the iperf3 completes the tests and the server is killed (SIGTERM) in line 1340 of the shell script ( In any case, it is not clear if the problem is in iperf3 or somehow in the openEuler implementation of Other information that can help, is showing by gdb the registers contents when the error happened, and the assembly code around the command where it happened. That may help to understand what is causing the SIGSEGV and give a hint for what is the root cause. Evaluation of the gdb output (which is very helpful!), shows that the SIGSEGV happened in |
Indeed, iperf3 should have received the SIGTERM signal sent by the shell script before receiving SIGSEGV. This makes one have to consider thread/process synchronization when the signal arrives. The script uses the -D option when starting iperf3 server mode, which can be seen from line 634 and line 734. After removing this option, the problem did not reproduce in thousands of tests. This again leads the problem to multi-process signal handling.
It can be roughly understood that openEuler is a clone of CentOS 7. There is nothing unique implementation about setjmp()/longjmp(). Moreover, the problem can be reproduced on openSUSE leap 15.6(I'll add this information to the initial post), using the same iperf3 code and script. Unfortunately, manually starting
gdb shows two threads, as shown in post 0, and the call trace that can be parsed (thread 2 in gdb, Thread 0x7f38bd3f6740) shows the working flow of the signal processing function of iperf3 when the SIGTERM signal is caught. However, it is thread 1 that generates SIGSEGV, and its call stack seems to be smashed (I have used Let me read the corresponding code first, thank you for your hint anyway. |
Closed by mistake, reopen it. |
I have also reproduced this issue, and I found the first commit that caused coredump. Context:
Steps to Reproduce
Bug Report
The kernel self-test script has completed running and print ok.
When executing the concurrent test phase, a segmentation fault occurred in iperf3. it produces many coredump files. core-iperf3-3353376-11:
maybe reason? |
Since the call trace in the core files(my test and ikernel-mryao) are incomplete or corrupted, I have to infer from the code logic that the race condition is caused by multiple threads calling the signal processing function at the same time. When PR #1752 allows only the main thread to handle signals, while the thread synchronization facilities in the existing code ensure that child threads do not race with the main thread for resources. please let me know if you have any different opinions :) |
I didn't realize that exit() may have undefined behavior when we call it from a signal handler. see man 3 exit, exit() is a MT-Unsafe interface. call flow as below:
In the presence of multiple threads, it may be necessary to handle synchronization and exit more gracefully. |
Good observation about If the |
Oops, my bad. I didn't observe that the client needs to use a signal handler to control the exit logic. But no abnormal exit can be reproduced after PR #1752. This at least shows that there is a race condition in the multi-threaded response to the signal.
Unfortunately, I tried to use mutex to protect exit(), but it still didn't solve the issue. Please forgive my unfamiliarity with the code. So far, I personally have no good way to ensure that only the main thread calls exit(). |
@ikernel-mryao, thanks for the detailed information! It shows that the issue in this case happened in Based on the inputs and discussions so far, I think that the problem may be that the Server is killed before it completed the test cleanup. That is, the Client completed, but the server is still in the process of terminating the worker threads. This is the reason why there are still two server threads, as after the test cleanup, only the main thread should be running. @xTire / @ikernel-mryao, to test the above conjuncture, can you retry running the test script (I don't have openEuler installed), but with adding If this will solve the problem, the next step will be to find a way to change the code, so that the signal handler will not try to do problematic actions if its state is not |
Sure, this should modify the test script to:
right? IIUC, after adding Here, is the test script I'm currently using. You can try it on your local test environment. Thanks to @ikernel-mryao for simplifying the test suit to shorten the test time. |
Hi @davidBar-On,
I think the reason why using It would be great if gdb could parse all call traces, but in fact... , this is the obstacle to locating the problem. If you can reproduce this problem locally and collect enough core files, you will always have a chance to find that a thread calls the alias
|
I tested this version, and it seems to work o.k.😄 For both the server and the client (using Did you try running the tests using this version to see if it solves the SIGSEGV problem?
Yes, this is what I meant (although the
As the test is with only one stream, in principle the number of cores should not have an impact. Therefore, I think the problem might be in the operating system or in the system/network functionality (see below).
Unfortunately, I am not able to reproduce the problem. However, from this and the other gdb outputs I suspect that the root cause is related to the operating system multi-core processing. When everything works o.k., the server cancels all the "worker" threads when the client ends execution (maybe with some small delay, this is why I asked for the Since both the main and the "worker" thread are still active when the server is being killed (even after the 5 seconds sleep), I can think of two options how this can happen:
In both cases, it may be that iperf3 receives a return code from
|
Thank you for your time.
Yes, I ran the test script through another 1000 loops, and no new segfault.
Yes, multi-core processors do not impact the keel cycle. But they will impact the scheduling of multiple threads, especially when multiple threads enter the signal processing function at the same time to respond to the signal. This process is asynchronous, so mutex or synchronous operation is needed.
If the number of processors is sufficient, it is recommended to use bare metal to reproduce the issue. The probability of reproducing in a virtual machine environment is very low. In another test environment of mine, a virtual machine with 48 processor cores was assigned and only 5 segmentation fault records were obtained after running the test script 200 times. Let's go back to
Before the kill command in the script is executed use the 'iperf3' keyword to filter in htop to obtain the following. The white font in the "Command" column are the processes, and the green font are the threads. Similar results were obtained with
Unfortunately, turning on debug output in my tests made the problem unreproducible. Remember what I mentioned before, if you remove the In case the problem cannot be reproduced in your environment, I have attached some core files I collected. The core files in the attachment are generated by ipef3 compiled by commit 67ca2c8, and also contains test script. Please let me know if I missed anything. |
I believe that mutex around these lines should be sufficient: Lines 137 to 139 in 7679199
In addition, the iperf_timestrerr should be changed to a local variable in iperf_err() and iperf_errexit() , instead of being global. (These changes can be done in a separate PR, but I think it is better to add them to #1752.)
Thanks great! I believe that in this case, PR #1752 is good enough for resolving the problem, even if the root case will not be fully understood.
I fully agree. The release of the resources by a "worker" thread (which your PR #1752 fixes) caused the fault in the main thread. What I still don't understand is why the "worker" thread is still active at that point.
Is this taken after the
Thanks a lot! I will try to evaluate these core dumps later to see if they can provide any further insight about the problem. |
Sorry for posting the modification directly here, as it does not fix the SIGSEGV issue. For your double check:
After the modification, a SIGSEGV occurred around the 3rd script loop:
To my uneducated eye, maybe a separated PR is needed. Based on the previous discussion, it should be very helpful to reproduce the problem locally.
Please ignore the
test output: |
@xTire, thanks for continuing to put effort for testing this issue.
Maybe you are right that a separate PR is required, but if possible, I suggest another try. It may be that the problem with your solution is that
I don't know whether this is needed ... For now, I believe it is o.k. to keep it.
The above change shows the status of the main Server threads (the # shellcheck disable=SC2046,SC2086 # word splitting wanted here
echo Waiting for Clients to terminate: ${pids}
ps -afT | grep iperf3
wait $(for pid in ${pids}; do echo ${pid}; done)
echo Ended waiting for Clients to terminate
ps -afT | grep iperf3
for pid in ${pids}; do
echo ${pid}
cat /proc/${pid}/cmdline
done
# shellcheck disable=SC2046,SC2086
# test added(https://github.com/esnet/iperf/issues/1750#issuecomment-2317382909)
sleep 5
echo Killing server: ${flood_pids}
ps -afT | grep iperf3
for pid in ${flood_pids}; do
echo ${pid} before kill
kill ${pid} 2>/dev/null
echo ${pid} after kill
cat /proc/${pid}/cmdline
done (Note that probably I will not be available for about the next two weeks, although I hope that there will be others that can help, especially if they have the environment for reproducing the issue.) |
I just tried it and it still doesn't work. As you mentioned, the mutex belonging to In my opinion, the following three conditions need to be met at the same time to trigger this problem:
As @ikernel-mryao mentioned,
I guess this commit is the first time the So far, PR #1752 is a lower cost solution unless it introduces a new regression.
Unfortunately, I may soon lose access to the physical machine environment and the machine will need to be used for other purposes. Really hope someone can reproduce this problem, so far, as long as you have enough processor cores. |
With your tests and analysis I now fully agree that PR #1752 should remain as is and the the
From the output I see that I completely misunderstood the script and that (As I wrote before, I will probably not be able to respond further in about the next two weeks, but if you will have more info about the processes before/after the kill it would be helpful to show it. Thanks.) |
|
Let me briefly introduce the logic of the nft_concat_range.sh script. First of all, it should be reminded that the key function is test_concurrency.
|
Context
Version of iperf3: 3.17.1+ (cJSON 1.7.15)
Hardware: x86_64&aarch64
Operating system (and distribution, if any): openEuler & openSUSE leap 15.6
Other relevant information :
Bug Report
Expected Behavior
The kernel self-test script has completed running
Actual Behavior
When executing the concurrent test phase, a segmentation fault occurred in iperf3
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/plain/tools/testing/selftests/netfilter/nft_concat_range.sh?h=v5.10.224
for i in $(seq 5); do ./nft_concat_range.sh; done
The text was updated successfully, but these errors were encountered: