-
-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
in parallel mode, no output for hanging test on timeout-induced interrupt #206
Comments
Sorry for the delay - What signal is being sent here? It should handle SIGINT correctly and do its best to emit logs. To confirm: are these being run in parallel? |
Yes, the tests are being run in parallel. A recent example from this morning is http://pr-test.k8s.io/20810/kubernetes-pull-build-test-e2e-gce/27402/build-log.txt. (All times in this log are UTC, I think.) We have the timeout set to 1hr, so at around 16:31, Jenkins sent an interrupt to abort, though there's oddly nothing in the console output from Ginkgo after 15:53 or so. Ginkgo printed a bit more, shortly after Jenkins uploaded its build log (asynchronous processes are great):
This appears to just be the output from the |
Another mysterious timeout, of the form in my first comment:
(Full log here) The lack of context is making it hard to tell where we were stuck. |
I'm trying to understand what might be different between running ginkgo locally, where we observe ginkgo responds correctly to I haven't used Jenkins in a while, so I'd just like to confirm that the signal that Jenkins sending to ginkgo is in fact |
I think it sends One guess I have is that Jenkins is sending |
Haven't looked at the code, but from https://wiki.jenkins-ci.org/display/JENKINS/Aborting+a+build:
|
And just to confirm, you are running the latest version of the ginkgo binary? It would be interesting if you're able to follow up on that line of investigation. Ginkgo does seem to spawn multiple os-level processes (at least on OSX; I assume the behavior is the same on linux) and it seems feasible that Jenkins would signal all the processes in the tree. |
I replied before I saw your comment above. It should be fairly easy to reproduce the described Jenkins behavior locally by creating a bunch of tests that run long enough to find and kill all the PIDs, and observe whether ginkgo hangs. |
We're currently using Ginkgo I can try simulating a "kill-all-Ginkgo-processes" setup to see if I can get it to reproduce locally. Looking around at Jenkins stuff, I found this long-open issue with Jenkins regarding graceful process termination (which is somewhat outdated - Jenkins doesn't use |
@ixdy Did you end up exploring that option or finding a resolution? |
I don't think we ever found a proper solution. I think it started happening less, possibly because we mitigated the root cause (bad tests?) in other ways. |
@williammartin @robdimsdale @ixdy let me try and refresh this bug a little.
and further shutdown messages. If you want to take a look, the logs are public: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/62?log#log (see the point where the timestamps switch from ~10pm to ~7am). Note in particular that ginkgo hangs before we send the interrupt. It is not the tests that are slow or stuck - at the point of the hang there are over 300 tests to go, and with the parallelism of 30 nodes, there should be progress even if one - or a handful of - tests were stuck. So I don't really understand Rob's request in #206 (comment) |
Thanks for the update. It's a bit hard to parse the output but I see what you mean, that there should be at least some progress across 30 nodes in the 9 hour gap. I'll have a think but nothing springs to mind immediately around how to debug without some changes or manual intervention. FWIW, in Garden we have a tool called |
How frequently does this happen @porridge? How reproducible is this? A long time ago (years) I invested effort to try to reproduce this but could not. I’ve seen it happen in the wild though and would love a reproducible case. My gut says the issue is somehow related to how the parallel nodes capture output - that some sort of non-obvious lower level bug (possibly outside of ginkgo... go? kernel?) gets triggered. I’d love to validate/invalidate this hypothesis. If you are seeing this with some frequency and my guess is correct then running ginkgo with the -stream flag should resolve the issue at the cost of giving you interleaved test output. I’m not saying -stream is the long-term fix - but I’d like to get some real data around whether or not it solves the problem. If it does then we can rewrite how log capture works and move it from the clients (the running nodes) to the server. |
@onsi happened 2 times so far with 75 attempts, so ~2.5% rate. We haven't tried reproducing it, since a successful run takes ~9 hours. Does BTW, when counting these hangs, I noticed today that in both cases the message about interrupt and starting AfterSuite is repeated after a millisecond or four:
Is this expected? |
@williammartin thanks for the suggestion. What would we need to look at if we caught it red-handed though? Just SIGABRT and take stack traces? @ixdy that script would not be terribly useful because it's timeout-based, but if we could periodically measure the size of build-log.txt, then an hour of no change in size would be a good indication that this is happening, and we could abort the run forcing stack trace dumps. Do you think this would be feasible? |
I was thinking Also, we're hopefully gonna set up an environment to run these e2e tests periodically ourselves in the next few days to start hunting this flake. Are there instructions anywhere on how to get these running in the same manner as your builds? |
That prefixed output does not look too bad. Sent a PR to make a subset of our test jobs use it, and I also another where I took a stab at replacing SIGINT with SIGABRT. |
Hey all please take a look at #461 and comment on whether or not its a viable/valuable next step towards making progress here. |
I'm working through the backlog of old Ginkgo issues - apologies as this issue is probably stale now. |
We have a test suite running ginkgo tests in parallel on Jenkins. I'm currently trying to track down a hanging test in kubernetes/kubernetes#13485, but it's been very hard to figure out what's wrong, since the log output doesn't seem to be including the hanging test.
The log for one of our failing runs shows nothing for over 30m:
I'm guessing this is intentional, since Ginkgo only prints out the log when a test completes. However, it doesn't seem to be handling the interrupt here properly - I'd expect it to dump any in-progress tests so that you could see what is stuck.
(I know about the Ginkgo parallel streaming mode, and I've been trying to use it, but this particular test failure seems to be very difficult to reproduce on demand.)
The text was updated successfully, but these errors were encountered: