-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workspace-in-sidecar
example test is very flakey
#4169
Comments
Here's a snippet of the pod yaml from the most recent failure I saw: status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-08-17T23:41:18Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-08-17T23:41:12Z"
message: 'containers with unready status: [step-unnamed-0 sidecar]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-08-17T23:41:12Z"
message: 'containers with unready status: [step-unnamed-0 sidecar]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-08-17T23:41:12Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: alpine:3.12.0
imageID: ""
lastState: {}
name: sidecar
ready: false
restartCount: 0
started: false
state:
waiting:
reason: PodInitializing
- image: alpine:3.12.0
imageID: ""
lastState: {}
name: step-unnamed-0
ready: false
restartCount: 0
started: false
state:
waiting:
reason: PodInitializing
hostIP: 10.128.0.3
initContainerStatuses:
- containerID: docker://62d43b4aa1f1792dbccabfa9833abbbae7dfd8c94018e0d812af84d09656b153
image: sha256:1da710f4bbbdfadf5c0f8ac1214f695105d05ea87c3f8c5795287c5bb5f50048
imageID: docker-pullable://gcr.io/tekton-prow-13/tpipeline-e2e-img/entrypoint-bff0a22da108bc2f16c818c97641a296@sha256:4125d5f93bbd74bc178ff5c29555b3cd4ad3f11ac39e38852400224d9a3f52df
lastState: {}
name: place-tools
ready: true
restartCount: 0
state:
terminated:
containerID: docker://62d43b4aa1f1792dbccabfa9833abbbae7dfd8c94018e0d812af84d09656b153
exitCode: 0
finishedAt: "2021-08-17T23:41:16Z"
reason: Completed
startedAt: "2021-08-17T23:41:16Z"
- containerID: docker://de1044deea32e4e2fc38601f9b1cc209a9cf9ad925127ae372e3720a2fdd106f
image: sha256:ac2afeee8f96767b2362a686e98503c9e1d1d24375be9896a0109b80e9cab631
imageID: docker-pullable://gcr.io/distroless/base@sha256:aa4fd987555ea10e1a4ec8765da8158b5ffdfef1e72da512c7ede509bc9966c4
lastState: {}
name: place-scripts
ready: true
restartCount: 0
state:
terminated:
containerID: docker://de1044deea32e4e2fc38601f9b1cc209a9cf9ad925127ae372e3720a2fdd106f
exitCode: 0
finishedAt: "2021-08-17T23:41:17Z"
reason: Completed
startedAt: "2021-08-17T23:41:17Z"
phase: Pending And the taskrun YAML: status:
completionTime: "2021-08-17T23:42:12Z"
conditions:
- lastTransitionTime: "2021-08-17T23:42:12Z"
message: TaskRun "workspace-in-sidecar-phrks" failed to finish within "1m0s"
reason: TaskRunTimeout
status: "False"
type: Succeeded
podName: workspace-in-sidecar-phrks-pod-x95zf
startTime: "2021-08-17T23:41:12Z"
steps:
- container: step-unnamed-0
name: unnamed-0
terminated:
exitCode: 1
finishedAt: "2021-08-17T23:42:12Z"
reason: TaskRunTimeout
startedAt: null So it looks like |
a recent failure was observed in https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/4322/pull-tekton-pipeline-alpha-integration-tests/1450527643072991232/ snippet of the pod yaml from the failure: status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-10-19T18:40:35Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-10-19T18:41:26Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-10-19T18:41:26Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-10-19T18:40:29Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://123efdc25e006661013b3539a8991f864c6ddb0c534c26a34b7991febb34c808
image: alpine:3.12.0
imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
lastState: {}
name: sidecar
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-10-19T18:41:26Z"
- containerID: docker://92936a2a5e44b9322c7ee2a1d4bb21effc930e3ba9a8cc938206c72fb61cb0e2
image: alpine:3.12.0
imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
lastState: {}
name: step-unnamed-0
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-10-19T18:41:26Z"
hostIP: 10.128.0.2
initContainerStatuses:
- containerID: docker://cb458054adb536e1bed3dc169e236e12b2010de15dc1109da3e39b3781581b53
image: sha256:ac3e8b91dabeb5702e2e47d26fc1102a382279faa9cbb26ad08db2e97c1c021e
imageID: docker-pullable://gcr.io/tekton-prow-1/tpipeline-e2e-img/entrypoint-bff0a22da108bc2f16c818c97641a296@sha256:e96742b774189b453add683ebc68df229b70794f82219300adee53413c01f69a
lastState: {}
name: place-tools
ready: true
restartCount: 0
state:
terminated:
containerID: docker://cb458054adb536e1bed3dc169e236e12b2010de15dc1109da3e39b3781581b53
exitCode: 0
finishedAt: "2021-10-19T18:40:33Z"
reason: Completed
startedAt: "2021-10-19T18:40:33Z"
- containerID: docker://5d4c347e735da93b1514a3e003e365ef6b558af7b4504eb54a0711a9b7826101
image: sha256:ac2afeee8f96767b2362a686e98503c9e1d1d24375be9896a0109b80e9cab631
imageID: docker-pullable://gcr.io/distroless/base@sha256:aa4fd987555ea10e1a4ec8765da8158b5ffdfef1e72da512c7ede509bc9966c4
lastState: {}
name: place-scripts
ready: true
restartCount: 0
state:
terminated:
containerID: docker://5d4c347e735da93b1514a3e003e365ef6b558af7b4504eb54a0711a9b7826101
exitCode: 0
finishedAt: "2021-10-19T18:40:35Z"
reason: Completed
startedAt: "2021-10-19T18:40:35Z"
phase: Running snippet of the taskrun yaml from the failure: status:
completionTime: "2021-10-19T18:41:29Z"
conditions:
- lastTransitionTime: "2021-10-19T18:41:29Z"
message: TaskRun "workspace-in-sidecar-rtmjf" failed to finish within "1m0s"
reason: TaskRunTimeout
status: "False"
type: Succeeded
podName: workspace-in-sidecar-rtmjf-pod-hj2bc
startTime: "2021-10-19T18:40:29Z"
steps:
- container: step-unnamed-0
imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
name: unnamed-0
terminated:
exitCode: 1
finishedAt: "2021-10-19T18:41:29Z"
reason: TaskRunTimeout
startedAt: "2021-10-19T18:41:26Z" |
Under
Edit: - looks like that specific warning comes from the kubelet when a volume definition hasn't been supplied for a volume mount name. The pod spec for |
especially weird that it wouldn't find the volume definition when we're using emptyDir! I thought maybe a PVC hadn't been created, but how is it possible for the emptyDir volume to not be found?? |
The workspace-in-sidecar example has been a relatively consistent flake in our CI suite. The issue #4169 was created in August describing this problem and there have been relatively regular reports related to the example since then. To a first approximation I've found 31 individual instances of the test's failure since May across 24 separate Pull Requests. Attempts to reduce the flakiness of this example have not been successful yet. So in order to reduce the amount of noise in our CI (and hopefully increase overall trust in the suite) this commit moves the example into the no-ci folder, preventing its execution, while we continue to try and figure out why it is so flakey.
Once PR 4634 is merged, this flake should be resolved. |
/close |
@vdemeester: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen It looks like this flaked again on #4596 The test run logs are here: https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/4596/pull-tekton-pipeline-alpha-integration-tests/1499073528324952067/ |
@sbwsg: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
From the logs in that test run we see the following:
So what we're seeing here is that the Step never created the fifo special file for communicating between the containers. Either that or the sidecar couldn't see the fifo special file for some reason. Looking at the container statuses it looks like both were ready and started at the same time:
The TaskRun's start time is recorded as |
yeah, its def strange. (abundant hand waving to come 👋 ) 😄 to check (A) we could add another debug echo before the mkfifo call, to check if the script itself is even getting called. To check (B) The only thing that I can think of is that, if it is some kind of race condition causing a lock in our use of |
I've tried to reproduce this, and fail every time. The failures I've seen in CI have all been |
I'm now fairly convinced that sidecar tests are just particularly vulnerable to limited resources on the cluster, and are therefore more likely to just take too long or otherwise flake out. Hopefully #4871 will improve this a bit by decreasing the number of e2e tests run in parallel from 16 to 4. |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
/lifecycle frozen |
I am going to close this as it doesn't seem to be as flake and I think a lot of PR went in to fix this issue one way or the other. |
Expected Behavior
Examples should run reasonably consistently from PR to PR.
Actual Behavior
I estimate that the
workspace-in-sidecar
example fails about half the time I initiate an integration run in our CI. It just doesn't seem to be able to complete successfully on a regular basis. It's super unclear why this is but here's the error log from the most recent run that failed for me:Hunch: maybe setting explicit resource requirements for the containers in the example would help?^ This unfortunately did not help after all.
Steps to Reproduce the Problem
workspace-in-sidecar
example failedThe text was updated successfully, but these errors were encountered: