Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workspace-in-sidecar example test is very flakey #4169

Closed
ghost opened this issue Aug 17, 2021 · 18 comments
Closed

workspace-in-sidecar example test is very flakey #4169

ghost opened this issue Aug 17, 2021 · 18 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@ghost
Copy link

ghost commented Aug 17, 2021

Expected Behavior

Examples should run reasonably consistently from PR to PR.

Actual Behavior

I estimate that the workspace-in-sidecar example fails about half the time I initiate an integration run in our CI. It just doesn't seem to be able to complete successfully on a regular basis. It's super unclear why this is but here's the error log from the most recent run that failed for me:

    build_logs.go:35: Could not get logs for pod workspace-in-sidecar-phrks-pod-x95zf: container "step-unnamed-0" in pod "workspace-in-sidecar-phrks-pod-x95zf" is waiting to start: PodInitializing

Hunch: maybe setting explicit resource requirements for the containers in the example would help?
^ This unfortunately did not help after all.

Steps to Reproduce the Problem

  1. Open a PR and let the integration suites run
  2. Check if workspace-in-sidecar example failed
  3. Re-run the test suites if the example passed
@ghost ghost added the kind/bug Categorizes issue or PR as related to a bug. label Aug 17, 2021
@ghost
Copy link
Author

ghost commented Aug 18, 2021

Here's a snippet of the pod yaml from the most recent failure I saw:

        status:
          conditions:
          - lastProbeTime: null
            lastTransitionTime: "2021-08-17T23:41:18Z"
            status: "True"
            type: Initialized
          - lastProbeTime: null
            lastTransitionTime: "2021-08-17T23:41:12Z"
            message: 'containers with unready status: [step-unnamed-0 sidecar]'
            reason: ContainersNotReady
            status: "False"
            type: Ready
          - lastProbeTime: null
            lastTransitionTime: "2021-08-17T23:41:12Z"
            message: 'containers with unready status: [step-unnamed-0 sidecar]'
            reason: ContainersNotReady
            status: "False"
            type: ContainersReady
          - lastProbeTime: null
            lastTransitionTime: "2021-08-17T23:41:12Z"
            status: "True"
            type: PodScheduled
          containerStatuses:
          - image: alpine:3.12.0
            imageID: ""
            lastState: {}
            name: sidecar
            ready: false
            restartCount: 0
            started: false
            state:
              waiting:
                reason: PodInitializing
          - image: alpine:3.12.0
            imageID: ""
            lastState: {}
            name: step-unnamed-0
            ready: false
            restartCount: 0
            started: false
            state:
              waiting:
                reason: PodInitializing
          hostIP: 10.128.0.3
          initContainerStatuses:
          - containerID: docker://62d43b4aa1f1792dbccabfa9833abbbae7dfd8c94018e0d812af84d09656b153
            image: sha256:1da710f4bbbdfadf5c0f8ac1214f695105d05ea87c3f8c5795287c5bb5f50048
            imageID: docker-pullable://gcr.io/tekton-prow-13/tpipeline-e2e-img/entrypoint-bff0a22da108bc2f16c818c97641a296@sha256:4125d5f93bbd74bc178ff5c29555b3cd4ad3f11ac39e38852400224d9a3f52df
            lastState: {}
            name: place-tools
            ready: true
            restartCount: 0
            state:
              terminated:
                containerID: docker://62d43b4aa1f1792dbccabfa9833abbbae7dfd8c94018e0d812af84d09656b153
                exitCode: 0
                finishedAt: "2021-08-17T23:41:16Z"
                reason: Completed
                startedAt: "2021-08-17T23:41:16Z"
          - containerID: docker://de1044deea32e4e2fc38601f9b1cc209a9cf9ad925127ae372e3720a2fdd106f
            image: sha256:ac2afeee8f96767b2362a686e98503c9e1d1d24375be9896a0109b80e9cab631
            imageID: docker-pullable://gcr.io/distroless/base@sha256:aa4fd987555ea10e1a4ec8765da8158b5ffdfef1e72da512c7ede509bc9966c4
            lastState: {}
            name: place-scripts
            ready: true
            restartCount: 0
            state:
              terminated:
                containerID: docker://de1044deea32e4e2fc38601f9b1cc209a9cf9ad925127ae372e3720a2fdd106f
                exitCode: 0
                finishedAt: "2021-08-17T23:41:17Z"
                reason: Completed
                startedAt: "2021-08-17T23:41:17Z"
          phase: Pending

And the taskrun YAML:

        status:
          completionTime: "2021-08-17T23:42:12Z"
          conditions:
          - lastTransitionTime: "2021-08-17T23:42:12Z"
            message: TaskRun "workspace-in-sidecar-phrks" failed to finish within "1m0s"
            reason: TaskRunTimeout
            status: "False"
            type: Succeeded
          podName: workspace-in-sidecar-phrks-pod-x95zf
          startTime: "2021-08-17T23:41:12Z"
          steps:
          - container: step-unnamed-0
            name: unnamed-0
            terminated:
              exitCode: 1
              finishedAt: "2021-08-17T23:42:12Z"
              reason: TaskRunTimeout
              startedAt: null

So it looks like 1m0s wasn't a long enough timeout, but the init containers at least were done within about 6 seconds of booting up. So it looks like there's ~ 54 seconds where the pod was just spinning its wheels in a PodInitializing state.

@bobcatfish bobcatfish added kind/flake Categorizes issue or PR as related to a flakey test good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 13, 2021
@jerop
Copy link
Member

jerop commented Oct 19, 2021

a recent failure was observed in https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/4322/pull-tekton-pipeline-alpha-integration-tests/1450527643072991232/

snippet of the pod yaml from the failure:

        status:
          conditions:
          - lastProbeTime: null
            lastTransitionTime: "2021-10-19T18:40:35Z"
            status: "True"
            type: Initialized
          - lastProbeTime: null
            lastTransitionTime: "2021-10-19T18:41:26Z"
            status: "True"
            type: Ready
          - lastProbeTime: null
            lastTransitionTime: "2021-10-19T18:41:26Z"
            status: "True"
            type: ContainersReady
          - lastProbeTime: null
            lastTransitionTime: "2021-10-19T18:40:29Z"
            status: "True"
            type: PodScheduled
          containerStatuses:
          - containerID: docker://123efdc25e006661013b3539a8991f864c6ddb0c534c26a34b7991febb34c808
            image: alpine:3.12.0
            imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
            lastState: {}
            name: sidecar
            ready: true
            restartCount: 0
            started: true
            state:
              running:
                startedAt: "2021-10-19T18:41:26Z"
          - containerID: docker://92936a2a5e44b9322c7ee2a1d4bb21effc930e3ba9a8cc938206c72fb61cb0e2
            image: alpine:3.12.0
            imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
            lastState: {}
            name: step-unnamed-0
            ready: true
            restartCount: 0
            started: true
            state:
              running:
                startedAt: "2021-10-19T18:41:26Z"
          hostIP: 10.128.0.2
          initContainerStatuses:
          - containerID: docker://cb458054adb536e1bed3dc169e236e12b2010de15dc1109da3e39b3781581b53
            image: sha256:ac3e8b91dabeb5702e2e47d26fc1102a382279faa9cbb26ad08db2e97c1c021e
            imageID: docker-pullable://gcr.io/tekton-prow-1/tpipeline-e2e-img/entrypoint-bff0a22da108bc2f16c818c97641a296@sha256:e96742b774189b453add683ebc68df229b70794f82219300adee53413c01f69a
            lastState: {}
            name: place-tools
            ready: true
            restartCount: 0
            state:
              terminated:
                containerID: docker://cb458054adb536e1bed3dc169e236e12b2010de15dc1109da3e39b3781581b53
                exitCode: 0
                finishedAt: "2021-10-19T18:40:33Z"
                reason: Completed
                startedAt: "2021-10-19T18:40:33Z"
          - containerID: docker://5d4c347e735da93b1514a3e003e365ef6b558af7b4504eb54a0711a9b7826101
            image: sha256:ac2afeee8f96767b2362a686e98503c9e1d1d24375be9896a0109b80e9cab631
            imageID: docker-pullable://gcr.io/distroless/base@sha256:aa4fd987555ea10e1a4ec8765da8158b5ffdfef1e72da512c7ede509bc9966c4
            lastState: {}
            name: place-scripts
            ready: true
            restartCount: 0
            state:
              terminated:
                containerID: docker://5d4c347e735da93b1514a3e003e365ef6b558af7b4504eb54a0711a9b7826101
                exitCode: 0
                finishedAt: "2021-10-19T18:40:35Z"
                reason: Completed
                startedAt: "2021-10-19T18:40:35Z"
          phase: Running

snippet of the taskrun yaml from the failure:

        status:
          completionTime: "2021-10-19T18:41:29Z"
          conditions:
          - lastTransitionTime: "2021-10-19T18:41:29Z"
            message: TaskRun "workspace-in-sidecar-rtmjf" failed to finish within "1m0s"
            reason: TaskRunTimeout
            status: "False"
            type: Succeeded
          podName: workspace-in-sidecar-rtmjf-pod-hj2bc
          startTime: "2021-10-19T18:40:29Z"
          steps:
          - container: step-unnamed-0
            imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
            name: unnamed-0
            terminated:
              exitCode: 1
              finishedAt: "2021-10-19T18:41:29Z"
              reason: TaskRunTimeout
              startedAt: "2021-10-19T18:41:26Z"

@ghost
Copy link
Author

ghost commented Nov 2, 2021

Under Events I also notice the following errors related to the emptyDir volume that the sidecar and step are supposd to share:

arendelle-4mlqr    4m26s       Warning   Failed                         pod/workspace-in-sidecar-phrks-pod-x95zf                                  Error: cannot find volume "ws-76s6x" to mount into container "step-unnamed-0"
arendelle-4mlqr    4m26s       Normal    Pulled                         pod/workspace-in-sidecar-phrks-pod-x95zf                                  Container image "alpine:3.12.0" already present on machine
arendelle-4mlqr    4m26s       Warning   Failed                         pod/workspace-in-sidecar-phrks-pod-x95zf                                  Error: cannot find volume "ws-76s6x" to mount into container "sidecar"

Edit: - looks like that specific warning comes from the kubelet when a volume definition hasn't been supplied for a volume mount name. The pod spec for workspace-in-sidecar-phrks-pod-x95zf does include a volume named ws-76s6x however, so there must be something else going wrong here.

@bobcatfish
Copy link
Collaborator

especially weird that it wouldn't find the volume definition when we're using emptyDir! I thought maybe a PVC hadn't been created, but how is it possible for the emptyDir volume to not be found??

tekton-robot pushed a commit that referenced this issue Dec 2, 2021
The workspace-in-sidecar example has been a relatively
consistent flake in our CI suite. The issue #4169
was created in August describing this problem and
there have been relatively regular reports related to the
example since then. To a first approximation I've found 31
individual instances of the test's failure since May across
24 separate Pull Requests.

Attempts to reduce the flakiness of this example have not
been successful yet. So in order to reduce the amount of noise
in our CI (and hopefully increase overall trust in the suite)
this commit moves the example into the no-ci folder, preventing
its execution, while we continue to try and figure out why it
is so flakey.
@bobcatfish bobcatfish removed the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Dec 16, 2021
@jerop jerop added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 7, 2022
@lbernick lbernick moved this to Todo in Pipelines V1 Feb 22, 2022
@xchapter7x
Copy link
Contributor

Once PR 4634 is merged, this flake should be resolved.

@vdemeester
Copy link
Member

/close

@tekton-robot
Copy link
Collaborator

@vdemeester: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Repository owner moved this from Todo to Done in Pipelines V1 Mar 2, 2022
@ghost
Copy link
Author

ghost commented Mar 2, 2022

/reopen

It looks like this flaked again on #4596

The test run logs are here: https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/4596/pull-tekton-pipeline-alpha-integration-tests/1499073528324952067/

@tekton-robot tekton-robot reopened this Mar 2, 2022
@tekton-robot
Copy link
Collaborator

@sbwsg: Reopened this issue.

In response to this:

/reopen

It looks like this flaked again on #4596

The test run logs are here: https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/4596/pull-tekton-pipeline-alpha-integration-tests/1499073528324952067/

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ghost
Copy link
Author

ghost commented Mar 3, 2022

From the logs in that test run we see the following:

    panic.go:632: #####################################################
    panic.go:632: ### Dumping logs from Pods in the arendelle-n7xcb ###
    panic.go:632: #####################################################
    build_logs.go:37: build logs 
        >>> Container step-unnamed-0:
        
        >>> Container sidecar:
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...

So what we're seeing here is that the Step never created the fifo special file for communicating between the containers. Either that or the sidecar couldn't see the fifo special file for some reason.

Looking at the container statuses it looks like both were ready and started at the same time:

          containerStatuses:
          - containerID: docker://d37e381690e0848177f126e3c85600adcf0d543ea87030389be710a4e0f7dc06
            image: alpine:3.12.0
            imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
            lastState: {}
            name: sidecar
            ready: true
            restartCount: 0
            started: true
            state:
              running:
                startedAt: "2022-03-02T17:45:33Z"
          - containerID: docker://8da5c68a979d2503c2e5c950df756171fcf63a3cec89e6139343a93633a80c2b
            image: alpine:3.12.0
            imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
            lastState: {}
            name: step-unnamed-0
            ready: true
            restartCount: 0
            started: true
            state:
              running:
                startedAt: "2022-03-02T17:45:33Z"

The TaskRun's start time is recorded as 2022-03-02T17:45:12Z and the pod's creation time as 2022-03-02T17:45:12Z. So it looks like it took about 20 seconds from pod creation to step + sidecar container starts. Is it possible that it then took > 40 seconds for the step to create the fifo file? Or that ... like ... (waves hands) ... the processes aren't being given cpu cycles enough to run in parallel ... or something?

@xchapter7x
Copy link
Contributor

xchapter7x commented Mar 3, 2022

yeah, its def strange. (abundant hand waving to come 👋 ) 😄
the mkfifo call should at least error if it was called and failed, and we should see the subsequent debug echos, but we arent seeing anything in the logs from the step container. which seems to indicate either, A) the script is never getting called due to some resource constraint or something, as you suggest, or B) mkfifo is locked when its called.

to check (A) we could add another debug echo before the mkfifo call, to check if the script itself is even getting called.

To check (B) The only thing that I can think of is that, if it is some kind of race condition causing a lock in our use of mkfifo, then maybe switching the creation of the named pipe to the sidecar would help. Reason being the sidecar is created before the step container, right? This could better ensure that the pipe is created before we start reading/writing/checking it.

@abayer
Copy link
Contributor

abayer commented Apr 19, 2022

I've tried to reproduce this, and fail every time. The failures I've seen in CI have all been container "step-unnamed-0" in pod "workspace-in-sidecar-vpnrg-pod" is waiting to start: PodInitializing (with different pod names, of course). I'm wondering if tearDown for e2e tests should also do a kubectl describe pod when it can't get logs for a pod due to PodInitializing?

@dibyom dibyom added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Apr 25, 2022
@abayer
Copy link
Contributor

abayer commented May 17, 2022

I'm now fairly convinced that sidecar tests are just particularly vulnerable to limited resources on the cluster, and are therefore more likely to just take too long or otherwise flake out. Hopefully #4871 will improve this a bit by decreasing the number of e2e tests run in parallel from 16 to 4.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2022
@abayer
Copy link
Contributor

abayer commented Aug 15, 2022

/lifecycle frozen

@tekton-robot tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 15, 2022
@afrittoli afrittoli added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 31, 2023
@vdemeester
Copy link
Member

I am going to close this as it doesn't seem to be as flake and I think a lot of PR went in to fix this issue one way or the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
Status: Done
Status: Done
Development

No branches or pull requests

8 participants