-
-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FTL v54b4ad93 crashes sometimes (several times today) for unknown reasons #2112
Comments
Could you please run https://deploy-preview-338--pihole-docs.netlify.app/ftldns/debugging/ has all the details incl. a special section how to do it inside a |
I can see the same happening both for nightly and development tags. This is what is shown after the mentioned debugging procedure inside the docker container:
It doesn't look very helpful. |
Seems, you need to need to add |
I have another with
|
Okay, so |
This issue has been mentioned on Pi-hole Userspace. There might be relevant details there: https://discourse.pi-hole.net/t/pi-hole-stopped-responding/73931/4 |
Thanks for your patience. However, when also adding
|
This means there was no crash. Why FTL exited with exit code 01 will be found in |
This is what the log contains:
Note, that this happens on the current development image, as well as yesterday's and today's nightly image. The log above is from the latest nightly because while it crashes, FTL recovers quicker while development does not and sometimes takes down the container. |
from e.g. the first line says the crash happened in process with PID 3270 which is a Fork of process 52. Unfortunately,
´The first crash you have reported here had the crash in According to the link above, Linux supports setting
allowing edit Clarified above and suggested a different command to run |
might work, too, but I am not sure if suspending the main process is still a healthy system. It may have other consequences. Easiest would probably be waiting until we get a crash not in a fork so |
With
With
|
Maybe it's a good idea to create a development build that includes debug symbols just to have a valid stack trace output? |
Here is "my" crash:
|
We already have the full set of debug symbols in the release builds, otherwise, you would not have seen the function names and related code lines in:
But this also shows that following forks/children won't work for us here. The location at where you entered @schuettecarsten Could you attach the debugger as well? This latest crash was in a fork, too, however, the first one (the very first post in this issue ticket) was in a normal thread of the main process ( |
Maybe here:
or
or
I'm not experienced with gdb. Thank you for your patience. If these backtraces are not helpful, I apologize |
Signal 17 is |
@bermeitinger-b I don't think there is anything wrong with the network table TBH, my current assumption is that somehow handling of the strings got broken and your database got populated with garbage data. This would likely explain the crash you are seeing, too. But I am still not sure how/where it happens. I have meanwhile manually inspected the entire string processing code paths twice and did not find anything odd. Please also enable |
Here an excerpt after flushing the network table:
|
Yes, this brings us a lot closer. Please restart
you can try to search for the first occurrence of this strange string like
|
It'd also be helpful if you could run
and see if any errors regarding hash tables are reported in |
Thanks. I've restarted and it prints a lot of
Typically, this is an indicator that it's not a literal E. After many of these lines, it looks the output changes:
About the hash collisions:
Seems okay, however, I'm not sure why it scans 100 clients. There should only be 3 (localhost, DoH, DoT). |
Looking at the logs, we are in a bit of a chicken-and-egg-problem as the database already has some broken stuff which then "contaminates" FTL's strings on the history import during startup. I'd suggest to disable database importing for the moment,
and then restart again. You may also want to clean out the log file first to get rid of the binary stuff that is in there right now. edit The 100 clients come about because FTL detects clients based on their (string) IP address and when the IP address is garbage, a new client is added for each new garbage string. |
I've deleted the old database and started new. It generated the following log:
The rest is the same as in the first logs above. As above, it did not restart automatically. The log file is not binary anymore, so the network table is looking correctly. |
Did you have the debugger |
Maybe:
and right after:
Then, the whole container dies. Edit: After some time, the glyphs reappear in the network table. I don't have a backtrace for those but will check the log. |
I fully understand and I'm thankful for your deep investigation. I can reproduce the crash by deleting |
Sorry for the delay, but I had a hard time thinking about what we could do. I did now prepare a new special branch |
Okay, thank you very much. How can I run this inside docker? |
@bermeitinger-b Unfortunately, for docker it means a few extra steps. You'll have to build your own container like
and then using the generated image |
Thanks, this worked perfectly and I'm running the debugging fork now. The log confirms this:
In gdb, it shows many lines with
This is a SIGSEV that I've observed (right after I've restarted the upstream resolver, might be unrelated)
|
Okay, so this confirms a crash again deep inside We have prepared instructions for Still sorry it is such a slow process... |
I don't think you're in a position to apologize ;) Thank you for your efforts. I've started valgrind. Within docker, that's a bit complicated because stopping pihole-FTL stops the container. I've adjusted the pihole:
image: pihole:local
...
entrypoint: /bin/bash
tty: true Then, I could start valgrind. The error log is here: It indicates something at pihole-FTL also dies with many I hope this helps. |
No, this is actually fine. There will be a lot of output, the log (or |
I hope this leads somewhere. I can't really reproduce the state. This time, it crashes after 58 minutes and it's detached with error 33. I've removed the nostops for 33 (and 37).
And here the valgrind.log: https://0x0.st/Xk7C.log There is one binary line in line 55 which I've removed. (If you need it, it's here https://0x0.st/Xk7F.bin) |
Thanks. The important bit is
I did just push an update to the same debug branch (commit 26d1dd3), please re-build the container and 🤞 |
Thanks, the new commit resulted in the probably same message:
Full log: https://0x0.st/XdHX.log This time, I couldn't I'm sure it's the correct version. From FTL.log:
|
Too bad, so back to the drawing board. I did just push another update to the branch adding further debug output. It'd be good if you could update and try again. Once it crashes, we'd need the address the algorithm tried to access, like what you quoted from
Please ensure you still have If you can attach both
edit The expected commit would be b4f194b |
This time, gdb spammed with "Thread 4 event 33" and the CPU usage was so busy with gdb and valgrind that I had to restart the server. So, I could not run these gdb commands. These are the last 7 minutes in FTL.log: https://0x0.st/Xdbb.txt And this is the valgrind.log: https://0x0.st/XdbZ.log |
Thanks. Unfortunately, this is another crash in another location without the "Invalid read of size 4" section so we don't get the address it tried to look at. I pushed another commit (59c22db) that adds some extra out-of-bounds checking. Lets hope I'm not going down the wrong track here... edit 1 It'll be interesting to see:
edit 2 I added another error message, updated the expected commit hash above |
Thanks. I think it didn't crash where it could be traced. I've removed the
(Then, it's dead.) valgrind.log: https://0x0.st/Xdbd.log These are the last lines in FTL.log: https://0x0.st/Xdbn.txt |
Yeah, it shouldn't stop at SIG33. When |
Okay, I think I have something new. With red, very dry eyes, I have been staring at your previous log and found the following interesting lines:
Okay ... so it tried to access memory at I am actually quite confident that adding the missing remapping instructions (commit 341464f) fixes the bug we are seeing here. It also explains why it mostly happens in forks and why I haven't seen it myself before (my network is simply to quiet/not enough queries). |
I take my hat off to you while bowing with respect. You did it. 👍 With this branch, it's running since 10 hours without a crash. Thank you very much. |
I just looked at your code. Is this code thread-safe? Or is thread safety no issue because of a global "lock" somewhere? |
Thank you for confirming the fix! I have now split this into two branches: The one you are currently on ( I did that to ease review as these are basically two independent changes which should result in two independent pull requests (#2120 and #2121 ). I also removed a few of the commits adding more debug output which we didn't need in the end. No need to keep "dead" code around.
We take care of only accessing these global memory sections when a global (shared-memory) lock is helt ( |
The bug should be fixed in |
FTL crashed for unknown reasons several times today.
Here is a full log;
The text was updated successfully, but these errors were encountered: