Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in parallel mode, no output for hanging test on timeout-induced interrupt #206

Closed
ixdy opened this issue Jan 6, 2016 · 20 comments
Closed

Comments

@ixdy
Copy link
Contributor

ixdy commented Jan 6, 2016

We have a test suite running ginkgo tests in parallel on Jenkins. I'm currently trying to track down a hanging test in kubernetes/kubernetes#13485, but it's been very hard to figure out what's wrong, since the log output doesn't seem to be including the hanging test.

The log for one of our failing runs shows nothing for over 30m:

01:30:04 STEP: Destroying namespace "e2e-tests-pod-disks-z910m" for this suite.
01:30:04 
01:30:04 
01:30:04 • [SLOW TEST:376.248 seconds]
01:30:04 Pod Disks
01:30:04 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:267
01:30:04   should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession
01:30:04   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:266
01:30:04 ------------------------------
02:03:34 Build timed out (after 60 minutes). Marking the build as aborted.
02:03:34 Build timed out (after 60 minutes). Marking the build as failed.
02:03:34 Build was aborted
02:03:34 Recording test results
02:03:34 
02:03:34 Ran 119 of 212 Specs in 2486.623 seconds
02:03:34 FAIL! -- 116 Passed | 0 Failed | 2 Pending | 91 Skipped 
02:03:34 
02:03:34 Ginkgo ran 1 suite in 41m27.528523653s
02:03:34 Test Suite Failed

I'm guessing this is intentional, since Ginkgo only prints out the log when a test completes. However, it doesn't seem to be handling the interrupt here properly - I'd expect it to dump any in-progress tests so that you could see what is stuck.

(I know about the Ginkgo parallel streaming mode, and I've been trying to use it, but this particular test failure seems to be very difficult to reproduce on demand.)

@onsi
Copy link
Owner

onsi commented Feb 1, 2016

Sorry for the delay - What signal is being sent here? It should handle SIGINT correctly and do its best to emit logs. To confirm: are these being run in parallel?

@ixdy
Copy link
Contributor Author

ixdy commented Feb 9, 2016

Yes, the tests are being run in parallel.

A recent example from this morning is http://pr-test.k8s.io/20810/kubernetes-pull-build-test-e2e-gce/27402/build-log.txt. (All times in this log are UTC, I think.) We have the timeout set to 1hr, so at around 16:31, Jenkins sent an interrupt to abort, though there's oddly nothing in the console output from Ginkgo after 15:53 or so.

Ginkgo printed a bit more, shortly after Jenkins uploaded its build log (asynchronous processes are great):

     -------------------------------------------------------------------
    |                                                                   |
    |  Ginkgo timed out waiting for all parallel nodes to report back!  |
    |                                                                   |
     -------------------------------------------------------------------

[1] Feb  8 15:43:55.346: INFO: Fetching cloud provider for "gce"
[1] 
[1] Feb  8 15:43:55.346: INFO: Using service account "[email protected]" as token source.
[1] I0208 15:43:55.346548   11512 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[1] I0208 15:43:55.654337   11512 e2e.go:238] Starting e2e run "c2afc91e-ce7a-11e5-8b5f-42010af00003" on Ginkgo node 1
[2] Feb  8 15:43:55.350: INFO: Fetching cloud provider for "gce"
[2] 
[2] Feb  8 15:43:55.350: INFO: Using service account "[email protected]" as token source.
[2] I0208 15:43:55.350881   11514 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[2] I0208 15:43:55.627287   11514 e2e.go:238] Starting e2e run "c2b4d45e-ce7a-11e5-9017-42010af00003" on Ginkgo node 2
[3] Feb  8 15:43:55.297: INFO: Fetching cloud provider for "gce"
[3] 
[3] Feb  8 15:43:55.299: INFO: Using service account "[email protected]" as token source.
[3] I0208 15:43:55.299358   11515 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[3] I0208 15:43:55.580292   11515 e2e.go:238] Starting e2e run "c2ae2c73-ce7a-11e5-a23a-42010af00003" on Ginkgo node 3
[4] Feb  8 15:43:55.331: INFO: Fetching cloud provider for "gce"
[4] 
[4] Feb  8 15:43:55.331: INFO: Using service account "[email protected]" as token source.
[4] I0208 15:43:55.331546   11516 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[4] I0208 15:43:55.625897   11516 e2e.go:238] Starting e2e run "c2b36644-ce7a-11e5-966d-42010af00003" on Ginkgo node 4
[5] Feb  8 15:43:55.326: INFO: Fetching cloud provider for "gce"
[5] 
[5] Feb  8 15:43:55.326: INFO: Using service account "[email protected]" as token source.
[5] I0208 15:43:55.326104   11517 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[5] I0208 15:43:55.627418   11517 e2e.go:238] Starting e2e run "c2b3fb5f-ce7a-11e5-8f17-42010af00003" on Ginkgo node 5
[6] Feb  8 15:43:55.312: INFO: Fetching cloud provider for "gce"
[6] 
[6] Feb  8 15:43:55.312: INFO: Using service account "[email protected]" as token source.
[6] I0208 15:43:55.312385   11523 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[6] I0208 15:43:55.593888   11523 e2e.go:238] Starting e2e run "c2b12809-ce7a-11e5-ac47-42010af00003" on Ginkgo node 6
[7] Feb  8 15:43:55.336: INFO: Fetching cloud provider for "gce"
[7] 
[7] Feb  8 15:43:55.336: INFO: Using service account "[email protected]" as token source.
[7] I0208 15:43:55.336177   11524 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[7] I0208 15:43:55.625327   11524 e2e.go:238] Starting e2e run "c2b41c04-ce7a-11e5-af6e-42010af00003" on Ginkgo node 7
[8] Feb  8 15:43:55.345: INFO: Fetching cloud provider for "gce"
[8] 
[8] Feb  8 15:43:55.345: INFO: Using service account "[email protected]" as token source.
[8] I0208 15:43:55.345206   11525 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[8] I0208 15:43:55.662656   11525 e2e.go:238] Starting e2e run "c2b280f0-ce7a-11e5-8837-42010af00003" on Ginkgo node 8
[9] Feb  8 15:43:55.353: INFO: Fetching cloud provider for "gce"
[9] 
[9] Feb  8 15:43:55.353: INFO: Using service account "[email protected]" as token source.
[9] I0208 15:43:55.353783   11527 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[9] I0208 15:43:55.625033   11527 e2e.go:238] Starting e2e run "c2b6704d-ce7a-11e5-ae36-42010af00003" on Ginkgo node 9
[10] Feb  8 15:43:55.322: INFO: Fetching cloud provider for "gce"
[10] 
[10] Feb  8 15:43:55.322: INFO: Using service account "[email protected]" as token source.
[10] I0208 15:43:55.322373   11534 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[10] I0208 15:43:55.664503   11534 e2e.go:238] Starting e2e run "c2b50e42-ce7a-11e5-bb3f-42010af00003" on Ginkgo node 10
[11] Feb  8 15:43:55.335: INFO: Fetching cloud provider for "gce"
[11] 
[11] Feb  8 15:43:55.335: INFO: Using service account "[email protected]" as token source.
[11] I0208 15:43:55.335243   11540 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[11] I0208 15:43:55.627937   11540 e2e.go:238] Starting e2e run "c2b36aca-ce7a-11e5-8f45-42010af00003" on Ginkgo node 11
[12] Feb  8 15:43:55.305: INFO: Fetching cloud provider for "gce"
[12] 
[12] Feb  8 15:43:55.305: INFO: Using service account "[email protected]" as token source.
[12] I0208 15:43:55.305352   11543 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[12] I0208 15:43:55.626266   11543 e2e.go:238] Starting e2e run "c2b2adcb-ce7a-11e5-b15e-42010af00003" on Ginkgo node 12
[13] Feb  8 15:43:55.350: INFO: Fetching cloud provider for "gce"
[13] 
[13] Feb  8 15:43:55.350: INFO: Using service account "[email protected]" as token source.
[13] I0208 15:43:55.350309   11546 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[13] I0208 15:43:55.663211   11546 e2e.go:238] Starting e2e run "c2b64027-ce7a-11e5-905b-42010af00003" on Ginkgo node 13
[14] Feb  8 15:43:55.339: INFO: Fetching cloud provider for "gce"
[14] 
[14] Feb  8 15:43:55.339: INFO: Using service account "[email protected]" as token source.
[14] I0208 15:43:55.339461   11551 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[14] I0208 15:43:55.649845   11551 e2e.go:238] Starting e2e run "c2b3b0bf-ce7a-11e5-b676-42010af00003" on Ginkgo node 14
[15] Feb  8 15:43:55.339: INFO: Fetching cloud provider for "gce"
[15] 
[15] Feb  8 15:43:55.339: INFO: Using service account "[email protected]" as token source.
[15] I0208 15:43:55.339457   11554 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[15] I0208 15:43:55.648613   11554 e2e.go:238] Starting e2e run "c2b382f5-ce7a-11e5-8618-42010af00003" on Ginkgo node 15
[16] Feb  8 15:43:55.311: INFO: Fetching cloud provider for "gce"
[16] 
[16] Feb  8 15:43:55.311: INFO: Using service account "[email protected]" as token source.
[16] I0208 15:43:55.311612   11558 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[16] I0208 15:43:55.625630   11558 e2e.go:238] Starting e2e run "c2b44dc5-ce7a-11e5-b4f1-42010af00003" on Ginkgo node 16
[17] Feb  8 15:43:55.346: INFO: Fetching cloud provider for "gce"
[17] 
[17] Feb  8 15:43:55.346: INFO: Using service account "[email protected]" as token source.
[17] I0208 15:43:55.346338   11570 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[17] I0208 15:43:55.650020   11570 e2e.go:238] Starting e2e run "c2b519fe-ce7a-11e5-9d3d-42010af00003" on Ginkgo node 17
[18] Feb  8 15:43:55.347: INFO: Fetching cloud provider for "gce"
[18] 
[18] Feb  8 15:43:55.347: INFO: Using service account "[email protected]" as token source.
[18] I0208 15:43:55.347902   11576 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[18] I0208 15:43:55.648747   11576 e2e.go:238] Starting e2e run "c2b5a46e-ce7a-11e5-80c4-42010af00003" on Ginkgo node 18
[19] Feb  8 15:43:55.344: INFO: Fetching cloud provider for "gce"
[19] 
[19] Feb  8 15:43:55.344: INFO: Using service account "[email protected]" as token source.
[19] I0208 15:43:55.344614   11579 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[19] I0208 15:43:55.690735   11579 e2e.go:238] Starting e2e run "c2b54690-ce7a-11e5-ae2b-42010af00003" on Ginkgo node 19
[20] Feb  8 15:43:55.320: INFO: Fetching cloud provider for "gce"
[20] 
[20] Feb  8 15:43:55.320: INFO: Using service account "[email protected]" as token source.
[20] I0208 15:43:55.320297   11582 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[20] I0208 15:43:55.650485   11582 e2e.go:238] Starting e2e run "c2b471f4-ce7a-11e5-a625-42010af00003" on Ginkgo node 20
[21] Feb  8 15:43:55.315: INFO: Fetching cloud provider for "gce"
[21] 
[21] Feb  8 15:43:55.315: INFO: Using service account "[email protected]" as token source.
[21] I0208 15:43:55.316001   11591 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[21] I0208 15:43:55.629659   11591 e2e.go:238] Starting e2e run "c2b512e9-ce7a-11e5-a8c7-42010af00003" on Ginkgo node 21
[22] Feb  8 15:43:55.338: INFO: Fetching cloud provider for "gce"
[22] 
[22] Feb  8 15:43:55.338: INFO: Using service account "[email protected]" as token source.
[22] I0208 15:43:55.338983   11597 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[22] I0208 15:43:55.629053   11597 e2e.go:238] Starting e2e run "c2b4d895-ce7a-11e5-8e4b-42010af00003" on Ginkgo node 22
[23] Feb  8 15:43:55.317: INFO: Fetching cloud provider for "gce"
[23] 
[23] Feb  8 15:43:55.317: INFO: Using service account "[email protected]" as token source.
[23] I0208 15:43:55.317895   11598 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[23] I0208 15:43:55.627417   11598 e2e.go:238] Starting e2e run "c2b559c4-ce7a-11e5-8156-42010af00003" on Ginkgo node 23
[24] Feb  8 15:43:55.344: INFO: Fetching cloud provider for "gce"
[24] 
[24] Feb  8 15:43:55.344: INFO: Using service account "[email protected]" as token source.
[24] I0208 15:43:55.344263   11604 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[24] I0208 15:43:55.626569   11604 e2e.go:238] Starting e2e run "c2b3b14d-ce7a-11e5-ba94-42010af00003" on Ginkgo node 24
[25] Feb  8 15:43:55.334: INFO: Fetching cloud provider for "gce"
[25] 
[25] Feb  8 15:43:55.334: INFO: Using service account "[email protected]" as token source.
[25] I0208 15:43:55.335002   11606 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[25] I0208 15:43:55.630943   11606 e2e.go:238] Starting e2e run "c2b6e38e-ce7a-11e5-b3b1-42010af00003" on Ginkgo node 25
[26] Feb  8 15:43:55.343: INFO: Fetching cloud provider for "gce"
[26] 
[26] Feb  8 15:43:55.343: INFO: Using service account "[email protected]" as token source.
[26] I0208 15:43:55.343876   11612 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[26] I0208 15:43:55.653070   11612 e2e.go:238] Starting e2e run "c2b44ed9-ce7a-11e5-b048-42010af00003" on Ginkgo node 26
[27] Feb  8 15:43:55.359: INFO: Fetching cloud provider for "gce"
[27] 
[27] Feb  8 15:43:55.359: INFO: Using service account "[email protected]" as token source.
[27] I0208 15:43:55.359629   11615 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[27] I0208 15:43:55.648333   11615 e2e.go:238] Starting e2e run "c2b7f2ec-ce7a-11e5-ae7a-42010af00003" on Ginkgo node 27
[28] Feb  8 15:43:55.343: INFO: Fetching cloud provider for "gce"
[28] 
[28] Feb  8 15:43:55.343: INFO: Using service account "[email protected]" as token source.
[28] I0208 15:43:55.343535   11626 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[28] I0208 15:43:55.624945   11626 e2e.go:238] Starting e2e run "c2b4bde2-ce7a-11e5-a916-42010af00003" on Ginkgo node 28
[29] Feb  8 15:43:55.317: INFO: Fetching cloud provider for "gce"
[29] 
[29] Feb  8 15:43:55.317: INFO: Using service account "[email protected]" as token source.
[29] I0208 15:43:55.317192   11671 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[29] I0208 15:43:55.626450   11671 e2e.go:238] Starting e2e run "c2b55a03-ce7a-11e5-b9a1-42010af00003" on Ginkgo node 29
[30] Feb  8 15:43:55.352: INFO: Fetching cloud provider for "gce"
[30] 
[30] Feb  8 15:43:55.352: INFO: Using service account "[email protected]" as token source.
[30] I0208 15:43:55.352810   11629 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[30] I0208 15:43:55.625739   11629 e2e.go:238] Starting e2e run "c2b5db8e-ce7a-11e5-a544-42010af00003" on Ginkgo node 30

Ginkgo ran 1 suite in 47m14.702681305s
Test Suite Failed

This appears to just be the output from the SynchronizedBeforeSuite function - nothing from any of the running tests that I can find.

@ixdy
Copy link
Contributor Author

ixdy commented Feb 10, 2016

Another mysterious timeout, of the form in my first comment:

21:36:27 • [SLOW TEST:338.895 seconds]
21:36:27 Services
21:36:27 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/service.go:885
21:36:27   should be able to change the type and ports of a service
21:36:27   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/service.go:700
21:36:27 ------------------------------
21:54:43 
Build timed out (after 30 minutes). Marking the build as failed.
21:54:43 Build was aborted

(Full log here)

The lack of context is making it hard to tell where we were stuck.

@robdimsdale
Copy link
Contributor

I'm trying to understand what might be different between running ginkgo locally, where we observe ginkgo responds correctly to SIGINT, and in your CI environment, where you observe that it hangs.

I haven't used Jenkins in a while, so I'd just like to confirm that the signal that Jenkins sending to ginkgo is in fact SIGINT, and not e.g. SIGTERM or SIGQUIT?

@ixdy
Copy link
Contributor Author

ixdy commented Feb 17, 2016

I think it sends SIGTERM, but with dbb5c6c that should be handled correctly.

One guess I have is that Jenkins is sending SIGTERM to all of the Ginkgo nodes, not just node 1, though I haven't fully investigated this theory yet. It seems like it might explain why the Ginkgo nodes aren't all returning.

@ixdy
Copy link
Contributor Author

ixdy commented Feb 17, 2016

Haven't looked at the code, but from https://wiki.jenkins-ci.org/display/JENKINS/Aborting+a+build:

The executor performs a clean up operation. This depends on what it was doing by the time it noticed the interruption.
If it was waiting for a completion of a child process, Jenkins will search for all the descendant processes and kill them all. On Unix, this is done through java.lang.UnixProcess.destroyProcess, which sends SIGTERM on Sun's JREs.

@robdimsdale
Copy link
Contributor

And just to confirm, you are running the latest version of the ginkgo binary?

It would be interesting if you're able to follow up on that line of investigation. Ginkgo does seem to spawn multiple os-level processes (at least on OSX; I assume the behavior is the same on linux) and it seems feasible that Jenkins would signal all the processes in the tree.

@robdimsdale
Copy link
Contributor

I replied before I saw your comment above. It should be fairly easy to reproduce the described Jenkins behavior locally by creating a bunch of tests that run long enough to find and kill all the PIDs, and observe whether ginkgo hangs.

@ixdy
Copy link
Contributor Author

ixdy commented Feb 17, 2016

We're currently using Ginkgo v1.2.0-42-g07d85e6, which appears to be the latest.

I can try simulating a "kill-all-Ginkgo-processes" setup to see if I can get it to reproduce locally.

Looking around at Jenkins stuff, I found this long-open issue with Jenkins regarding graceful process termination (which is somewhat outdated - Jenkins doesn't use SIGKILL anymore), and also stumbled upon https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller, which notes setting BUILD_ID=dontKillMe in the environment to prevent its process killer from killing processes. This latter one is probably a bad idea (seems like it'd be easy to get zombie Ginkgo processes that way), but I may also explore it as an option.

@williammartin
Copy link
Collaborator

@ixdy Did you end up exploring that option or finding a resolution?

@ixdy
Copy link
Contributor Author

ixdy commented Dec 18, 2017

I don't think we ever found a proper solution. I think it started happening less, possibly because we mitigated the root cause (bad tests?) in other ways.

@porridge
Copy link

@williammartin @robdimsdale @ixdy let me try and refresh this bug a little.
What we are currently seeing in kubernetes/kubernetes#56046 is:

  • we are no longer running this on jenkins, but are interrupting ginkgo from our own code
  • a suite of 700+ tests ran in parallel on 30 ginkgo nodes
  • progress stops at some point, after ~41 minutes
  • after ~9 more hours the process which started ginkgo killed its whole process group with syscall.Kill(-cmd.Process.Pid, syscall.SIGINT)
  • at this point some output from one of the tests was produced, followed by:
I0119 07:32:25.116] ---------------------------------------------------------
I0119 07:32:25.116] Received interrupt.  Running AfterSuite...
I0119 07:32:25.116] ^C again to terminate immediately
I0119 07:32:25.116] Jan 19 07:31:34.058: INFO: Running AfterSuite actions on all node

and further shutdown messages.

If you want to take a look, the logs are public: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/62?log#log (see the point where the timestamps switch from ~10pm to ~7am).

Note in particular that ginkgo hangs before we send the interrupt. It is not the tests that are slow or stuck - at the point of the hang there are over 300 tests to go, and with the parallelism of 30 nodes, there should be progress even if one - or a handful of - tests were stuck. So I don't really understand Rob's request in #206 (comment)

@williammartin
Copy link
Collaborator

Thanks for the update. It's a bit hard to parse the output but I see what you mean, that there should be at least some progress across 30 nodes in the 9 hour gap. I'll have a think but nothing springs to mind immediately around how to debug without some changes or manual intervention.

FWIW, in Garden we have a tool called slowmobius (https://github.com/cloudfoundry/garden-runc-release/blob/develop/ci/scripts/slowmobius) that sends us a slack message when a test suite is hanging so we can go and poke at things. Might be useful for you if this becomes a real pain point in the mean time.

@onsi
Copy link
Owner

onsi commented Jan 20, 2018

How frequently does this happen @porridge? How reproducible is this?

A long time ago (years) I invested effort to try to reproduce this but could not. I’ve seen it happen in the wild though and would love a reproducible case.

My gut says the issue is somehow related to how the parallel nodes capture output - that some sort of non-obvious lower level bug (possibly outside of ginkgo... go? kernel?) gets triggered.

I’d love to validate/invalidate this hypothesis. If you are seeing this with some frequency and my guess is correct then running ginkgo with the -stream flag should resolve the issue at the cost of giving you interleaved test output.

I’m not saying -stream is the long-term fix - but I’d like to get some real data around whether or not it solves the problem. If it does then we can rewrite how log capture works and move it from the clients (the running nodes) to the server.

@porridge
Copy link

porridge commented Jan 29, 2018

@onsi happened 2 times so far with 75 attempts, so ~2.5% rate. We haven't tried reproducing it, since a successful run takes ~9 hours.

Does -stream leave some way of matching up output from individual cases? For example prefix output with node ID? Without it the logs would be nearly useless, and with such low error rate, it would take us months to say with confidence that it has helped.

BTW, when counting these hangs, I noticed today that in both cases the message about interrupt and starting AfterSuite is repeated after a millisecond or four:

porridge@kielonek:~/projects/k8s/56046$ grep 'Received interrupt' */*
scale/31:I1117 07:31:07.814] Received interrupt.  Running AfterSuite...
scale/31:I1117 07:31:07.815] Received interrupt.  Running AfterSuite...
scale/62:I0119 07:32:25.116] Received interrupt.  Running AfterSuite...
scale/62:I0119 07:32:25.120] Received interrupt.  Running AfterSuite...

Is this expected?

@porridge
Copy link

@williammartin thanks for the suggestion. What would we need to look at if we caught it red-handed though? Just SIGABRT and take stack traces?

@ixdy that script would not be terribly useful because it's timeout-based, but if we could periodically measure the size of build-log.txt, then an hour of no change in size would be a good indication that this is happening, and we could abort the run forcing stack trace dumps. Do you think this would be feasible?
Actually now that I think about it, perhaps we could SIGABRT unconditionally on timeout, instead of SIGINT?

@williammartin
Copy link
Collaborator

williammartin commented Feb 12, 2018

@porridge

-stream prints node numbers along with output e.g:

○ → ginkgo -nodes 4 -stream
[4] Running Suite: Measurebenchtest Suite
[4] =====================================
[4] Random Seed: 1518452635
[4] Parallel test node 4/4.
[4]
[1] Running Suite: Measurebenchtest Suite
[1] =====================================
[1] Random Seed: 1518452635
[1] Parallel test node 1/4.
[1]
[4] ••
[4] Ran 2 of 2 Specs in 0.001 seconds
[4] SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 0 Skipped PASS
[1] •
[1] Ran 1 of 1 Specs in 0.000 seconds
[1] SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped PASS
[3] Running Suite: Measurebenchtest Suite
[3] =====================================
[3] Random Seed: 1518452635
[3] Parallel test node 3/4.
[3]
[3]
[3] Ran 0 of 0 Specs in 0.000 seconds
[3] SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 0 Skipped PASS
[2] Running Suite: Measurebenchtest Suite
[2] =====================================
[2] Random Seed: 1518452635
[2] Parallel test node 2/4.
[2]
[2]
[2] Ran 0 of 0 Specs in 0.000 seconds
[2] SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 0 Skipped PASS

Ginkgo ran 1 suite in 454.006805ms
Test Suite Passed

I was thinking SIGABRT and take stack traces although I'm not 100% sure where the useful information would go. Looking at https://github.com/onsi/ginkgo/blob/master/internal/remote/output_interceptor_unix.go#L37-L38 when not using -stream we dup stdout and stderr to point at a file, so maybe we could find stack dumps in there. Alternatively just having a look at the system and understanding the process tree, the state of fds and finding out whether strace reveals anything interesting would also be neat.

Also, we're hopefully gonna set up an environment to run these e2e tests periodically ourselves in the next few days to start hunting this flake. Are there instructions anywhere on how to get these running in the same manner as your builds?

@porridge
Copy link

That prefixed output does not look too bad. Sent a PR to make a subset of our test jobs use it, and I also another where I took a stab at replacing SIGINT with SIGABRT.
Regarding replicating k8s test environment, I don't know how feasible that is. Perhaps @ixdy would know.

@onsi
Copy link
Owner

onsi commented Apr 25, 2018

Hey all please take a look at #461 and comment on whether or not its a viable/valuable next step towards making progress here.

@onsi
Copy link
Owner

onsi commented Apr 6, 2021

I'm working through the backlog of old Ginkgo issues - apologies as this issue is probably stale now.
I'm going to close this for now, but feel free to reopen it if you'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants