in parallel mode, no output for hanging test on timeout-induced interrupt #206

ixdy · 2016-01-06T02:47:53Z

We have a test suite running ginkgo tests in parallel on Jenkins. I'm currently trying to track down a hanging test in kubernetes/kubernetes#13485, but it's been very hard to figure out what's wrong, since the log output doesn't seem to be including the hanging test.

The log for one of our failing runs shows nothing for over 30m:

01:30:04 STEP: Destroying namespace "e2e-tests-pod-disks-z910m" for this suite.
01:30:04 
01:30:04 
01:30:04 • [SLOW TEST:376.248 seconds]
01:30:04 Pod Disks
01:30:04 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:267
01:30:04   should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession
01:30:04   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:266
01:30:04 ------------------------------
02:03:34 Build timed out (after 60 minutes). Marking the build as aborted.
02:03:34 Build timed out (after 60 minutes). Marking the build as failed.
02:03:34 Build was aborted
02:03:34 Recording test results
02:03:34 
02:03:34 Ran 119 of 212 Specs in 2486.623 seconds
02:03:34 FAIL! -- 116 Passed | 0 Failed | 2 Pending | 91 Skipped 
02:03:34 
02:03:34 Ginkgo ran 1 suite in 41m27.528523653s
02:03:34 Test Suite Failed

I'm guessing this is intentional, since Ginkgo only prints out the log when a test completes. However, it doesn't seem to be handling the interrupt here properly - I'd expect it to dump any in-progress tests so that you could see what is stuck.

(I know about the Ginkgo parallel streaming mode, and I've been trying to use it, but this particular test failure seems to be very difficult to reproduce on demand.)

The text was updated successfully, but these errors were encountered:

onsi · 2016-02-01T01:51:43Z

Sorry for the delay - What signal is being sent here? It should handle SIGINT correctly and do its best to emit logs. To confirm: are these being run in parallel?

ixdy · 2016-02-09T01:43:28Z

Yes, the tests are being run in parallel.

A recent example from this morning is http://pr-test.k8s.io/20810/kubernetes-pull-build-test-e2e-gce/27402/build-log.txt. (All times in this log are UTC, I think.) We have the timeout set to 1hr, so at around 16:31, Jenkins sent an interrupt to abort, though there's oddly nothing in the console output from Ginkgo after 15:53 or so.

Ginkgo printed a bit more, shortly after Jenkins uploaded its build log (asynchronous processes are great):

     -------------------------------------------------------------------
    |                                                                   |
    |  Ginkgo timed out waiting for all parallel nodes to report back!  |
    |                                                                   |
     -------------------------------------------------------------------

[1] Feb  8 15:43:55.346: INFO: Fetching cloud provider for "gce"
[1] 
[1] Feb  8 15:43:55.346: INFO: Using service account "[email protected]" as token source.
[1] I0208 15:43:55.346548   11512 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[1] I0208 15:43:55.654337   11512 e2e.go:238] Starting e2e run "c2afc91e-ce7a-11e5-8b5f-42010af00003" on Ginkgo node 1
[2] Feb  8 15:43:55.350: INFO: Fetching cloud provider for "gce"
[2] 
[2] Feb  8 15:43:55.350: INFO: Using service account "[email protected]" as token source.
[2] I0208 15:43:55.350881   11514 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[2] I0208 15:43:55.627287   11514 e2e.go:238] Starting e2e run "c2b4d45e-ce7a-11e5-9017-42010af00003" on Ginkgo node 2
[3] Feb  8 15:43:55.297: INFO: Fetching cloud provider for "gce"
[3] 
[3] Feb  8 15:43:55.299: INFO: Using service account "[email protected]" as token source.
[3] I0208 15:43:55.299358   11515 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[3] I0208 15:43:55.580292   11515 e2e.go:238] Starting e2e run "c2ae2c73-ce7a-11e5-a23a-42010af00003" on Ginkgo node 3
[4] Feb  8 15:43:55.331: INFO: Fetching cloud provider for "gce"
[4] 
[4] Feb  8 15:43:55.331: INFO: Using service account "[email protected]" as token source.
[4] I0208 15:43:55.331546   11516 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[4] I0208 15:43:55.625897   11516 e2e.go:238] Starting e2e run "c2b36644-ce7a-11e5-966d-42010af00003" on Ginkgo node 4
[5] Feb  8 15:43:55.326: INFO: Fetching cloud provider for "gce"
[5] 
[5] Feb  8 15:43:55.326: INFO: Using service account "[email protected]" as token source.
[5] I0208 15:43:55.326104   11517 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[5] I0208 15:43:55.627418   11517 e2e.go:238] Starting e2e run "c2b3fb5f-ce7a-11e5-8f17-42010af00003" on Ginkgo node 5
[6] Feb  8 15:43:55.312: INFO: Fetching cloud provider for "gce"
[6] 
[6] Feb  8 15:43:55.312: INFO: Using service account "[email protected]" as token source.
[6] I0208 15:43:55.312385   11523 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[6] I0208 15:43:55.593888   11523 e2e.go:238] Starting e2e run "c2b12809-ce7a-11e5-ac47-42010af00003" on Ginkgo node 6
[7] Feb  8 15:43:55.336: INFO: Fetching cloud provider for "gce"
[7] 
[7] Feb  8 15:43:55.336: INFO: Using service account "[email protected]" as token source.
[7] I0208 15:43:55.336177   11524 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[7] I0208 15:43:55.625327   11524 e2e.go:238] Starting e2e run "c2b41c04-ce7a-11e5-af6e-42010af00003" on Ginkgo node 7
[8] Feb  8 15:43:55.345: INFO: Fetching cloud provider for "gce"
[8] 
[8] Feb  8 15:43:55.345: INFO: Using service account "[email protected]" as token source.
[8] I0208 15:43:55.345206   11525 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[8] I0208 15:43:55.662656   11525 e2e.go:238] Starting e2e run "c2b280f0-ce7a-11e5-8837-42010af00003" on Ginkgo node 8
[9] Feb  8 15:43:55.353: INFO: Fetching cloud provider for "gce"
[9] 
[9] Feb  8 15:43:55.353: INFO: Using service account "[email protected]" as token source.
[9] I0208 15:43:55.353783   11527 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[9] I0208 15:43:55.625033   11527 e2e.go:238] Starting e2e run "c2b6704d-ce7a-11e5-ae36-42010af00003" on Ginkgo node 9
[10] Feb  8 15:43:55.322: INFO: Fetching cloud provider for "gce"
[10] 
[10] Feb  8 15:43:55.322: INFO: Using service account "[email protected]" as token source.
[10] I0208 15:43:55.322373   11534 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[10] I0208 15:43:55.664503   11534 e2e.go:238] Starting e2e run "c2b50e42-ce7a-11e5-bb3f-42010af00003" on Ginkgo node 10
[11] Feb  8 15:43:55.335: INFO: Fetching cloud provider for "gce"
[11] 
[11] Feb  8 15:43:55.335: INFO: Using service account "[email protected]" as token source.
[11] I0208 15:43:55.335243   11540 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[11] I0208 15:43:55.627937   11540 e2e.go:238] Starting e2e run "c2b36aca-ce7a-11e5-8f45-42010af00003" on Ginkgo node 11
[12] Feb  8 15:43:55.305: INFO: Fetching cloud provider for "gce"
[12] 
[12] Feb  8 15:43:55.305: INFO: Using service account "[email protected]" as token source.
[12] I0208 15:43:55.305352   11543 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[12] I0208 15:43:55.626266   11543 e2e.go:238] Starting e2e run "c2b2adcb-ce7a-11e5-b15e-42010af00003" on Ginkgo node 12
[13] Feb  8 15:43:55.350: INFO: Fetching cloud provider for "gce"
[13] 
[13] Feb  8 15:43:55.350: INFO: Using service account "[email protected]" as token source.
[13] I0208 15:43:55.350309   11546 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[13] I0208 15:43:55.663211   11546 e2e.go:238] Starting e2e run "c2b64027-ce7a-11e5-905b-42010af00003" on Ginkgo node 13
[14] Feb  8 15:43:55.339: INFO: Fetching cloud provider for "gce"
[14] 
[14] Feb  8 15:43:55.339: INFO: Using service account "[email protected]" as token source.
[14] I0208 15:43:55.339461   11551 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[14] I0208 15:43:55.649845   11551 e2e.go:238] Starting e2e run "c2b3b0bf-ce7a-11e5-b676-42010af00003" on Ginkgo node 14
[15] Feb  8 15:43:55.339: INFO: Fetching cloud provider for "gce"
[15] 
[15] Feb  8 15:43:55.339: INFO: Using service account "[email protected]" as token source.
[15] I0208 15:43:55.339457   11554 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[15] I0208 15:43:55.648613   11554 e2e.go:238] Starting e2e run "c2b382f5-ce7a-11e5-8618-42010af00003" on Ginkgo node 15
[16] Feb  8 15:43:55.311: INFO: Fetching cloud provider for "gce"
[16] 
[16] Feb  8 15:43:55.311: INFO: Using service account "[email protected]" as token source.
[16] I0208 15:43:55.311612   11558 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[16] I0208 15:43:55.625630   11558 e2e.go:238] Starting e2e run "c2b44dc5-ce7a-11e5-b4f1-42010af00003" on Ginkgo node 16
[17] Feb  8 15:43:55.346: INFO: Fetching cloud provider for "gce"
[17] 
[17] Feb  8 15:43:55.346: INFO: Using service account "[email protected]" as token source.
[17] I0208 15:43:55.346338   11570 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[17] I0208 15:43:55.650020   11570 e2e.go:238] Starting e2e run "c2b519fe-ce7a-11e5-9d3d-42010af00003" on Ginkgo node 17
[18] Feb  8 15:43:55.347: INFO: Fetching cloud provider for "gce"
[18] 
[18] Feb  8 15:43:55.347: INFO: Using service account "[email protected]" as token source.
[18] I0208 15:43:55.347902   11576 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[18] I0208 15:43:55.648747   11576 e2e.go:238] Starting e2e run "c2b5a46e-ce7a-11e5-80c4-42010af00003" on Ginkgo node 18
[19] Feb  8 15:43:55.344: INFO: Fetching cloud provider for "gce"
[19] 
[19] Feb  8 15:43:55.344: INFO: Using service account "[email protected]" as token source.
[19] I0208 15:43:55.344614   11579 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[19] I0208 15:43:55.690735   11579 e2e.go:238] Starting e2e run "c2b54690-ce7a-11e5-ae2b-42010af00003" on Ginkgo node 19
[20] Feb  8 15:43:55.320: INFO: Fetching cloud provider for "gce"
[20] 
[20] Feb  8 15:43:55.320: INFO: Using service account "[email protected]" as token source.
[20] I0208 15:43:55.320297   11582 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[20] I0208 15:43:55.650485   11582 e2e.go:238] Starting e2e run "c2b471f4-ce7a-11e5-a625-42010af00003" on Ginkgo node 20
[21] Feb  8 15:43:55.315: INFO: Fetching cloud provider for "gce"
[21] 
[21] Feb  8 15:43:55.315: INFO: Using service account "[email protected]" as token source.
[21] I0208 15:43:55.316001   11591 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[21] I0208 15:43:55.629659   11591 e2e.go:238] Starting e2e run "c2b512e9-ce7a-11e5-a8c7-42010af00003" on Ginkgo node 21
[22] Feb  8 15:43:55.338: INFO: Fetching cloud provider for "gce"
[22] 
[22] Feb  8 15:43:55.338: INFO: Using service account "[email protected]" as token source.
[22] I0208 15:43:55.338983   11597 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[22] I0208 15:43:55.629053   11597 e2e.go:238] Starting e2e run "c2b4d895-ce7a-11e5-8e4b-42010af00003" on Ginkgo node 22
[23] Feb  8 15:43:55.317: INFO: Fetching cloud provider for "gce"
[23] 
[23] Feb  8 15:43:55.317: INFO: Using service account "[email protected]" as token source.
[23] I0208 15:43:55.317895   11598 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[23] I0208 15:43:55.627417   11598 e2e.go:238] Starting e2e run "c2b559c4-ce7a-11e5-8156-42010af00003" on Ginkgo node 23
[24] Feb  8 15:43:55.344: INFO: Fetching cloud provider for "gce"
[24] 
[24] Feb  8 15:43:55.344: INFO: Using service account "[email protected]" as token source.
[24] I0208 15:43:55.344263   11604 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[24] I0208 15:43:55.626569   11604 e2e.go:238] Starting e2e run "c2b3b14d-ce7a-11e5-ba94-42010af00003" on Ginkgo node 24
[25] Feb  8 15:43:55.334: INFO: Fetching cloud provider for "gce"
[25] 
[25] Feb  8 15:43:55.334: INFO: Using service account "[email protected]" as token source.
[25] I0208 15:43:55.335002   11606 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[25] I0208 15:43:55.630943   11606 e2e.go:238] Starting e2e run "c2b6e38e-ce7a-11e5-b3b1-42010af00003" on Ginkgo node 25
[26] Feb  8 15:43:55.343: INFO: Fetching cloud provider for "gce"
[26] 
[26] Feb  8 15:43:55.343: INFO: Using service account "[email protected]" as token source.
[26] I0208 15:43:55.343876   11612 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[26] I0208 15:43:55.653070   11612 e2e.go:238] Starting e2e run "c2b44ed9-ce7a-11e5-b048-42010af00003" on Ginkgo node 26
[27] Feb  8 15:43:55.359: INFO: Fetching cloud provider for "gce"
[27] 
[27] Feb  8 15:43:55.359: INFO: Using service account "[email protected]" as token source.
[27] I0208 15:43:55.359629   11615 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[27] I0208 15:43:55.648333   11615 e2e.go:238] Starting e2e run "c2b7f2ec-ce7a-11e5-ae7a-42010af00003" on Ginkgo node 27
[28] Feb  8 15:43:55.343: INFO: Fetching cloud provider for "gce"
[28] 
[28] Feb  8 15:43:55.343: INFO: Using service account "[email protected]" as token source.
[28] I0208 15:43:55.343535   11626 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[28] I0208 15:43:55.624945   11626 e2e.go:238] Starting e2e run "c2b4bde2-ce7a-11e5-a916-42010af00003" on Ginkgo node 28
[29] Feb  8 15:43:55.317: INFO: Fetching cloud provider for "gce"
[29] 
[29] Feb  8 15:43:55.317: INFO: Using service account "[email protected]" as token source.
[29] I0208 15:43:55.317192   11671 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[29] I0208 15:43:55.626450   11671 e2e.go:238] Starting e2e run "c2b55a03-ce7a-11e5-b9a1-42010af00003" on Ginkgo node 29
[30] Feb  8 15:43:55.352: INFO: Fetching cloud provider for "gce"
[30] 
[30] Feb  8 15:43:55.352: INFO: Using service account "[email protected]" as token source.
[30] I0208 15:43:55.352810   11629 gce.go:244] Using existing Token Source &oauth2.reuseTokenSource{new:google.computeSource{account:"[email protected]"}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
[30] I0208 15:43:55.625739   11629 e2e.go:238] Starting e2e run "c2b5db8e-ce7a-11e5-a544-42010af00003" on Ginkgo node 30

Ginkgo ran 1 suite in 47m14.702681305s
Test Suite Failed

This appears to just be the output from the SynchronizedBeforeSuite function - nothing from any of the running tests that I can find.

ixdy · 2016-02-10T19:46:39Z

Another mysterious timeout, of the form in my first comment:

21:36:27 • [SLOW TEST:338.895 seconds]
21:36:27 Services
21:36:27 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/service.go:885
21:36:27   should be able to change the type and ports of a service
21:36:27   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/service.go:700
21:36:27 ------------------------------
21:54:43 
Build timed out (after 30 minutes). Marking the build as failed.
21:54:43 Build was aborted

(Full log here)

The lack of context is making it hard to tell where we were stuck.

robdimsdale · 2016-02-17T05:44:19Z

I'm trying to understand what might be different between running ginkgo locally, where we observe ginkgo responds correctly to SIGINT, and in your CI environment, where you observe that it hangs.

I haven't used Jenkins in a while, so I'd just like to confirm that the signal that Jenkins sending to ginkgo is in fact SIGINT, and not e.g. SIGTERM or SIGQUIT?

ixdy · 2016-02-17T05:51:52Z

I think it sends SIGTERM, but with dbb5c6c that should be handled correctly.

One guess I have is that Jenkins is sending SIGTERM to all of the Ginkgo nodes, not just node 1, though I haven't fully investigated this theory yet. It seems like it might explain why the Ginkgo nodes aren't all returning.

ixdy · 2016-02-17T05:56:35Z

Haven't looked at the code, but from https://wiki.jenkins-ci.org/display/JENKINS/Aborting+a+build:

The executor performs a clean up operation. This depends on what it was doing by the time it noticed the interruption.
If it was waiting for a completion of a child process, Jenkins will search for all the descendant processes and kill them all. On Unix, this is done through java.lang.UnixProcess.destroyProcess, which sends SIGTERM on Sun's JREs.

robdimsdale · 2016-02-17T06:00:13Z

And just to confirm, you are running the latest version of the ginkgo binary?

It would be interesting if you're able to follow up on that line of investigation. Ginkgo does seem to spawn multiple os-level processes (at least on OSX; I assume the behavior is the same on linux) and it seems feasible that Jenkins would signal all the processes in the tree.

robdimsdale · 2016-02-17T06:02:21Z

I replied before I saw your comment above. It should be fairly easy to reproduce the described Jenkins behavior locally by creating a bunch of tests that run long enough to find and kill all the PIDs, and observe whether ginkgo hangs.

ixdy · 2016-02-17T06:11:55Z

We're currently using Ginkgo v1.2.0-42-g07d85e6, which appears to be the latest.

I can try simulating a "kill-all-Ginkgo-processes" setup to see if I can get it to reproduce locally.

Looking around at Jenkins stuff, I found this long-open issue with Jenkins regarding graceful process termination (which is somewhat outdated - Jenkins doesn't use SIGKILL anymore), and also stumbled upon https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller, which notes setting BUILD_ID=dontKillMe in the environment to prevent its process killer from killing processes. This latter one is probably a bad idea (seems like it'd be easy to get zombie Ginkgo processes that way), but I may also explore it as an option.

williammartin · 2017-12-04T21:58:30Z

@ixdy Did you end up exploring that option or finding a resolution?

ixdy · 2017-12-18T19:41:57Z

I don't think we ever found a proper solution. I think it started happening less, possibly because we mitigated the root cause (bad tests?) in other ways.

porridge · 2018-01-19T08:39:08Z

@williammartin @robdimsdale @ixdy let me try and refresh this bug a little.
What we are currently seeing in kubernetes/kubernetes#56046 is:

we are no longer running this on jenkins, but are interrupting ginkgo from our own code
a suite of 700+ tests ran in parallel on 30 ginkgo nodes
progress stops at some point, after ~41 minutes
after ~9 more hours the process which started ginkgo killed its whole process group with syscall.Kill(-cmd.Process.Pid, syscall.SIGINT)
at this point some output from one of the tests was produced, followed by:

I0119 07:32:25.116] ---------------------------------------------------------
I0119 07:32:25.116] Received interrupt.  Running AfterSuite...
I0119 07:32:25.116] ^C again to terminate immediately
I0119 07:32:25.116] Jan 19 07:31:34.058: INFO: Running AfterSuite actions on all node

and further shutdown messages.

If you want to take a look, the logs are public: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/62?log#log (see the point where the timestamps switch from ~10pm to ~7am).

Note in particular that ginkgo hangs before we send the interrupt. It is not the tests that are slow or stuck - at the point of the hang there are over 300 tests to go, and with the parallelism of 30 nodes, there should be progress even if one - or a handful of - tests were stuck. So I don't really understand Rob's request in #206 (comment)

williammartin · 2018-01-19T14:50:32Z

Thanks for the update. It's a bit hard to parse the output but I see what you mean, that there should be at least some progress across 30 nodes in the 9 hour gap. I'll have a think but nothing springs to mind immediately around how to debug without some changes or manual intervention.

FWIW, in Garden we have a tool called slowmobius (https://github.com/cloudfoundry/garden-runc-release/blob/develop/ci/scripts/slowmobius) that sends us a slack message when a test suite is hanging so we can go and poke at things. Might be useful for you if this becomes a real pain point in the mean time.

onsi · 2018-01-20T14:45:25Z

How frequently does this happen @porridge? How reproducible is this?

A long time ago (years) I invested effort to try to reproduce this but could not. I’ve seen it happen in the wild though and would love a reproducible case.

My gut says the issue is somehow related to how the parallel nodes capture output - that some sort of non-obvious lower level bug (possibly outside of ginkgo... go? kernel?) gets triggered.

I’d love to validate/invalidate this hypothesis. If you are seeing this with some frequency and my guess is correct then running ginkgo with the -stream flag should resolve the issue at the cost of giving you interleaved test output.

I’m not saying -stream is the long-term fix - but I’d like to get some real data around whether or not it solves the problem. If it does then we can rewrite how log capture works and move it from the clients (the running nodes) to the server.

porridge · 2018-01-29T11:36:02Z

@onsi happened 2 times so far with 75 attempts, so ~2.5% rate. We haven't tried reproducing it, since a successful run takes ~9 hours.

Does -stream leave some way of matching up output from individual cases? For example prefix output with node ID? Without it the logs would be nearly useless, and with such low error rate, it would take us months to say with confidence that it has helped.

BTW, when counting these hangs, I noticed today that in both cases the message about interrupt and starting AfterSuite is repeated after a millisecond or four:

porridge@kielonek:~/projects/k8s/56046$ grep 'Received interrupt' */*
scale/31:I1117 07:31:07.814] Received interrupt.  Running AfterSuite...
scale/31:I1117 07:31:07.815] Received interrupt.  Running AfterSuite...
scale/62:I0119 07:32:25.116] Received interrupt.  Running AfterSuite...
scale/62:I0119 07:32:25.120] Received interrupt.  Running AfterSuite...

Is this expected?

porridge · 2018-01-29T11:41:02Z

@williammartin thanks for the suggestion. What would we need to look at if we caught it red-handed though? Just SIGABRT and take stack traces?

@ixdy that script would not be terribly useful because it's timeout-based, but if we could periodically measure the size of build-log.txt, then an hour of no change in size would be a good indication that this is happening, and we could abort the run forcing stack trace dumps. Do you think this would be feasible?
Actually now that I think about it, perhaps we could SIGABRT unconditionally on timeout, instead of SIGINT?

williammartin · 2018-02-12T16:28:31Z

@porridge

-stream prints node numbers along with output e.g:

○ → ginkgo -nodes 4 -stream
[4] Running Suite: Measurebenchtest Suite
[4] =====================================
[4] Random Seed: 1518452635
[4] Parallel test node 4/4.
[4]
[1] Running Suite: Measurebenchtest Suite
[1] =====================================
[1] Random Seed: 1518452635
[1] Parallel test node 1/4.
[1]
[4] ••
[4] Ran 2 of 2 Specs in 0.001 seconds
[4] SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 0 Skipped PASS
[1] •
[1] Ran 1 of 1 Specs in 0.000 seconds
[1] SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped PASS
[3] Running Suite: Measurebenchtest Suite
[3] =====================================
[3] Random Seed: 1518452635
[3] Parallel test node 3/4.
[3]
[3]
[3] Ran 0 of 0 Specs in 0.000 seconds
[3] SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 0 Skipped PASS
[2] Running Suite: Measurebenchtest Suite
[2] =====================================
[2] Random Seed: 1518452635
[2] Parallel test node 2/4.
[2]
[2]
[2] Ran 0 of 0 Specs in 0.000 seconds
[2] SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 0 Skipped PASS

Ginkgo ran 1 suite in 454.006805ms
Test Suite Passed

I was thinking SIGABRT and take stack traces although I'm not 100% sure where the useful information would go. Looking at https://github.com/onsi/ginkgo/blob/master/internal/remote/output_interceptor_unix.go#L37-L38 when not using -stream we dup stdout and stderr to point at a file, so maybe we could find stack dumps in there. Alternatively just having a look at the system and understanding the process tree, the state of fds and finding out whether strace reveals anything interesting would also be neat.

Also, we're hopefully gonna set up an environment to run these e2e tests periodically ourselves in the next few days to start hunting this flake. Are there instructions anywhere on how to get these running in the same manner as your builds?

porridge · 2018-02-15T11:13:35Z

That prefixed output does not look too bad. Sent a PR to make a subset of our test jobs use it, and I also another where I took a stab at replacing SIGINT with SIGABRT.
Regarding replicating k8s test environment, I don't know how feasible that is. Perhaps @ixdy would know.

onsi · 2018-04-25T22:33:47Z

Hey all please take a look at #461 and comment on whether or not its a viable/valuable next step towards making progress here.

onsi · 2021-04-06T15:25:13Z

I'm working through the backlog of old Ginkgo issues - apologies as this issue is probably stale now.
I'm going to close this for now, but feel free to reopen it if you'd like.

ixdy mentioned this issue Feb 10, 2016

occasional e2e suite timeouts kubernetes/kubernetes#20778

Closed

ixdy mentioned this issue Mar 7, 2016

e2e flake: "Kubectl client Simple pod should support exec through an HTTP proxy" occasionally hangs kubernetes/kubernetes#22671

Closed

williammartin added the waiting-for-response label Dec 4, 2017

ixdy mentioned this issue Dec 18, 2017

ci-kubernetes-e2e-gce-scale-correctness stuck doing nothing for over 8 hours kubernetes/kubernetes#56046

Closed

williammartin added requires-investigation and removed waiting-for-response labels Jan 19, 2018

This was referenced Feb 15, 2018

Use streaming mode of ginkgo. kubernetes/test-infra#6845

Merged

Abort on timeout, to get stack traces. kubernetes/test-infra#6846

Merged

bobcatfish mentioned this issue Apr 5, 2018

Evaluate test frameworks knative/serving#253

Closed

onsi mentioned this issue Apr 25, 2018

Add -showDebugAddress to enable debugging hanging parallel tests #461

Open

aojea mentioned this issue Feb 10, 2020

How to run kubernetes e2e tests with kind and upgrade the cluster components with your local changes kubernetes-sigs/kind#1181

Open

onsi closed this as completed Apr 6, 2021

shysank mentioned this issue Nov 9, 2021

[WIP] Increase parallelism for e2e tests kubernetes-sigs/cluster-api-provider-azure#1816

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in parallel mode, no output for hanging test on timeout-induced interrupt #206

in parallel mode, no output for hanging test on timeout-induced interrupt #206

ixdy commented Jan 6, 2016

onsi commented Feb 1, 2016

ixdy commented Feb 9, 2016

ixdy commented Feb 10, 2016

robdimsdale commented Feb 17, 2016

ixdy commented Feb 17, 2016

ixdy commented Feb 17, 2016

robdimsdale commented Feb 17, 2016

robdimsdale commented Feb 17, 2016

ixdy commented Feb 17, 2016

williammartin commented Dec 4, 2017

ixdy commented Dec 18, 2017

porridge commented Jan 19, 2018

williammartin commented Jan 19, 2018

onsi commented Jan 20, 2018

porridge commented Jan 29, 2018 •

edited

Loading

porridge commented Jan 29, 2018

williammartin commented Feb 12, 2018 •

edited

Loading

porridge commented Feb 15, 2018

onsi commented Apr 25, 2018

onsi commented Apr 6, 2021

in parallel mode, no output for hanging test on timeout-induced interrupt #206

in parallel mode, no output for hanging test on timeout-induced interrupt #206

Comments

ixdy commented Jan 6, 2016

onsi commented Feb 1, 2016

ixdy commented Feb 9, 2016

ixdy commented Feb 10, 2016

robdimsdale commented Feb 17, 2016

ixdy commented Feb 17, 2016

ixdy commented Feb 17, 2016

robdimsdale commented Feb 17, 2016

robdimsdale commented Feb 17, 2016

ixdy commented Feb 17, 2016

williammartin commented Dec 4, 2017

ixdy commented Dec 18, 2017

porridge commented Jan 19, 2018

williammartin commented Jan 19, 2018

onsi commented Jan 20, 2018

porridge commented Jan 29, 2018 • edited Loading

porridge commented Jan 29, 2018

williammartin commented Feb 12, 2018 • edited Loading

porridge commented Feb 15, 2018

onsi commented Apr 25, 2018

onsi commented Apr 6, 2021

porridge commented Jan 29, 2018 •

edited

Loading

williammartin commented Feb 12, 2018 •

edited

Loading