-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent error in startup sequence: s6-svwait: fatal: unable to s6_svstatus_read: No such file or directory
#460
Comments
Ah, that's annoying. Can you locally patch s6-overlay to change the line to I suppose in the future I will have to make that sleep duration configurable, so that people can make |
@skarnet thanks for the quick reply. I can definitely try that, although it will take some time to confirm whether it helps or not. But so, you're saying that using the new rc.d layer would not lead to this issue? |
@nocive Yes, if you convert your services to s6-rc, the race condition goes away. Solving this was one of the first things that went into the s6-rc spec. ^^' |
Ok, thanks! I think I'll try that instead, otherwise I might be banished to the shadow realm by my peers if I propose something that monkey patches a s6 file and changes an arbitrary sleep 😅 In all seriousness, it's probably for the best to switch to the new layer anyway, it was just not at the top of the priority list :) Thanks for all your amazing work! <3 |
BTW refactoring cont-init, services and cont-finish to s6-rc isn't feasible for Home Assistant's case. In the new s6-rc format, every service must declare a dependency on their one-shots. But Home Assistant ships a set of base images which comes with a couple of Converting them to s6-rc would require everyone who adds their own services on top of the base images to be aware of every oneshot and explicitly mark their services as dependents on them. This is unintended, since we want those cont-init scripts to always run. Out of curiosity, this was the reason why we didn't convert our legacy scripts to s6-rc while upgrading to v3. That said, here are my 2-cents:
Thank you, as always, for S6-Overlay! |
|
Oh wow! Thanks a lot! |
I've added an @nocive I certainly don't want to deter you from converting your services to s6-rc :-) but if you can build s6-overlay from source, could you tell me if using that variable (with a greater value than |
Oops, sorry, I missed all the activity going on here :-) Unfortunately I immediately jumped into the s6-rc bandwagon and everything has been converted to the new format and has been working like a charm since. |
It's fine, converting to s6-rc is a good move no matter what. 😄 I'm closing this then. |
@skarnet in the environment where this issue occurs to me, it's very rare, therefore it's difficult to test. I'm trying to find a way to replicate the issue so that I can test if increasing I have tried several things, without success (i.e. I'm not able to replicate the issue reliably). Do you have any idea on how this race condition can be exercised?
By any chance do you have any hint on what made your environment reproduce the issue? |
@felipecrs The only way to exercise the race condition is to be unlucky with the scheduler's decisions. You can probably increase the frequency by overloading the machine, i.e. making it difficult for a given process to get a CPU timeslice. The race condition comes from the fact that With the s6-rc approach, all the With the The optimal sleep duration depends on your machine's load, really. If you expect CPU time to be at a premium, increase the duration by a significant amount. |
@skarnet thanks a lot for the very detailed explanation. I'll conduct some tests based on this, and will share the result as well. |
Ok, this is really complicated. I tried to stress my CPU using the stress with s-tui, but it does not seem to make any difference, I think. Sorry I had to upload the videos somewhere else because they are too big for GitHub. S6_SERVICE_READY=1Multiple failures after 1 minute of video. I was able to reproduce the failure several times offline as well. https://1drv.ms/v/s!AntpCaY1YjnMi6ItGFtMq9wgcVb4Nw?e=RoLwsh S6_SERVICE_READY=50No failures, almost 5 minutes of testing. I wasn't able to catch a failure at all using https://1drv.ms/v/s!AntpCaY1YjnMi6ItGFtMq9wgcVb4Nw?e=ZdP36s ConclusionI know the testing environment is far from ideal, but I think this is already a very good signal. |
I wrote a simple script to test it for me: #!/bin/bash
function test() {
rm -f "$1"-log.txt
local attempts=500
local failures=0
for i in $(seq 1 $attempts); do
clear
echo "S6_SERVICES_READYTIME=${1} - Attempt $i of $attempts"
set -x
docker rm -f test
docker run --name test --env=S6_SERVICES_READYTIME="$1" --rm -d ghcr.io/home-assistant/generic-x86-64-homeassistant:2022.11.4
sleep 1.25s
set +x
if docker logs test 2>&1 |
tee /dev/stderr |
grep -q "s6-svwait: fatal: unable to s6_svstatus_read: No such file or directory"; then
echo "attempt ${i} at $(date)" >> "$1"-log.txt
failures=$((failures + 1))
# Bell
echo -e '\a'
fi
done
echo "Failed ${failures} out of ${attempts} attempts for S6_SERVICES_READYTIME=${1}" >> result.txt
}
test 0
test 5
test 50 Here's the result:
Failures for
Failures for
I think this is conclusive, I can safely say that increasing |
So apparently 5 ms is doing nothing, which in hindsight isn't that surprising because it's a very small delay. 50 ms feels like a bit much, but is probably still in the unnoticeable range, so if it works for you, it should be safe for everyone. I increased the default sleep time to 50 ms in the latest git. Thanks for testing! |
Wow, thanks so much! This will bring me a lot of peace of mind. |
Since we upgraded to 3.x we started observing some seemingly random errors during the startup sequence of some of our applications. The error symptoms are as follows:
This is a container that runs a legacy cont-init layer and two services via the legacy services.d layer (nginx + php-fpm).
Other s6 environment details:
Unfortunately I don't have an isolated reproduction scenario that I can provide, the issue is also happening intermittently for us and I haven't found a way to consistently reproduce it. We can only observe it during our CI pipelines, despite using the exact same docker containers that we use everywhere else.
I'm mostly posting this looking for hints or clues as to what might be going on.
The text was updated successfully, but these errors were encountered: