`workspace-in-sidecar` example test is very flakey #4169

ghost · 2021-08-17T23:55:49Z

Expected Behavior

Examples should run reasonably consistently from PR to PR.

Actual Behavior

I estimate that the workspace-in-sidecar example fails about half the time I initiate an integration run in our CI. It just doesn't seem to be able to complete successfully on a regular basis. It's super unclear why this is but here's the error log from the most recent run that failed for me:

    build_logs.go:35: Could not get logs for pod workspace-in-sidecar-phrks-pod-x95zf: container "step-unnamed-0" in pod "workspace-in-sidecar-phrks-pod-x95zf" is waiting to start: PodInitializing

~~Hunch: maybe setting explicit resource requirements for the containers in the example would help?~~
^ This unfortunately did not help after all.

Steps to Reproduce the Problem

Open a PR and let the integration suites run
Check if workspace-in-sidecar example failed
Re-run the test suites if the example passed

The text was updated successfully, but these errors were encountered:

ghost · 2021-08-18T00:01:03Z

Here's a snippet of the pod yaml from the most recent failure I saw:

        status:
          conditions:
          - lastProbeTime: null
            lastTransitionTime: "2021-08-17T23:41:18Z"
            status: "True"
            type: Initialized
          - lastProbeTime: null
            lastTransitionTime: "2021-08-17T23:41:12Z"
            message: 'containers with unready status: [step-unnamed-0 sidecar]'
            reason: ContainersNotReady
            status: "False"
            type: Ready
          - lastProbeTime: null
            lastTransitionTime: "2021-08-17T23:41:12Z"
            message: 'containers with unready status: [step-unnamed-0 sidecar]'
            reason: ContainersNotReady
            status: "False"
            type: ContainersReady
          - lastProbeTime: null
            lastTransitionTime: "2021-08-17T23:41:12Z"
            status: "True"
            type: PodScheduled
          containerStatuses:
          - image: alpine:3.12.0
            imageID: ""
            lastState: {}
            name: sidecar
            ready: false
            restartCount: 0
            started: false
            state:
              waiting:
                reason: PodInitializing
          - image: alpine:3.12.0
            imageID: ""
            lastState: {}
            name: step-unnamed-0
            ready: false
            restartCount: 0
            started: false
            state:
              waiting:
                reason: PodInitializing
          hostIP: 10.128.0.3
          initContainerStatuses:
          - containerID: docker://62d43b4aa1f1792dbccabfa9833abbbae7dfd8c94018e0d812af84d09656b153
            image: sha256:1da710f4bbbdfadf5c0f8ac1214f695105d05ea87c3f8c5795287c5bb5f50048
            imageID: docker-pullable://gcr.io/tekton-prow-13/tpipeline-e2e-img/entrypoint-bff0a22da108bc2f16c818c97641a296@sha256:4125d5f93bbd74bc178ff5c29555b3cd4ad3f11ac39e38852400224d9a3f52df
            lastState: {}
            name: place-tools
            ready: true
            restartCount: 0
            state:
              terminated:
                containerID: docker://62d43b4aa1f1792dbccabfa9833abbbae7dfd8c94018e0d812af84d09656b153
                exitCode: 0
                finishedAt: "2021-08-17T23:41:16Z"
                reason: Completed
                startedAt: "2021-08-17T23:41:16Z"
          - containerID: docker://de1044deea32e4e2fc38601f9b1cc209a9cf9ad925127ae372e3720a2fdd106f
            image: sha256:ac2afeee8f96767b2362a686e98503c9e1d1d24375be9896a0109b80e9cab631
            imageID: docker-pullable://gcr.io/distroless/base@sha256:aa4fd987555ea10e1a4ec8765da8158b5ffdfef1e72da512c7ede509bc9966c4
            lastState: {}
            name: place-scripts
            ready: true
            restartCount: 0
            state:
              terminated:
                containerID: docker://de1044deea32e4e2fc38601f9b1cc209a9cf9ad925127ae372e3720a2fdd106f
                exitCode: 0
                finishedAt: "2021-08-17T23:41:17Z"
                reason: Completed
                startedAt: "2021-08-17T23:41:17Z"
          phase: Pending

And the taskrun YAML:

        status:
          completionTime: "2021-08-17T23:42:12Z"
          conditions:
          - lastTransitionTime: "2021-08-17T23:42:12Z"
            message: TaskRun "workspace-in-sidecar-phrks" failed to finish within "1m0s"
            reason: TaskRunTimeout
            status: "False"
            type: Succeeded
          podName: workspace-in-sidecar-phrks-pod-x95zf
          startTime: "2021-08-17T23:41:12Z"
          steps:
          - container: step-unnamed-0
            name: unnamed-0
            terminated:
              exitCode: 1
              finishedAt: "2021-08-17T23:42:12Z"
              reason: TaskRunTimeout
              startedAt: null

So it looks like 1m0s wasn't a long enough timeout, but the init containers at least were done within about 6 seconds of booting up. So it looks like there's ~ 54 seconds where the pod was just spinning its wheels in a PodInitializing state.

jerop · 2021-10-19T18:58:27Z

a recent failure was observed in https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/4322/pull-tekton-pipeline-alpha-integration-tests/1450527643072991232/

snippet of the pod yaml from the failure:

        status:
          conditions:
          - lastProbeTime: null
            lastTransitionTime: "2021-10-19T18:40:35Z"
            status: "True"
            type: Initialized
          - lastProbeTime: null
            lastTransitionTime: "2021-10-19T18:41:26Z"
            status: "True"
            type: Ready
          - lastProbeTime: null
            lastTransitionTime: "2021-10-19T18:41:26Z"
            status: "True"
            type: ContainersReady
          - lastProbeTime: null
            lastTransitionTime: "2021-10-19T18:40:29Z"
            status: "True"
            type: PodScheduled
          containerStatuses:
          - containerID: docker://123efdc25e006661013b3539a8991f864c6ddb0c534c26a34b7991febb34c808
            image: alpine:3.12.0
            imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
            lastState: {}
            name: sidecar
            ready: true
            restartCount: 0
            started: true
            state:
              running:
                startedAt: "2021-10-19T18:41:26Z"
          - containerID: docker://92936a2a5e44b9322c7ee2a1d4bb21effc930e3ba9a8cc938206c72fb61cb0e2
            image: alpine:3.12.0
            imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
            lastState: {}
            name: step-unnamed-0
            ready: true
            restartCount: 0
            started: true
            state:
              running:
                startedAt: "2021-10-19T18:41:26Z"
          hostIP: 10.128.0.2
          initContainerStatuses:
          - containerID: docker://cb458054adb536e1bed3dc169e236e12b2010de15dc1109da3e39b3781581b53
            image: sha256:ac3e8b91dabeb5702e2e47d26fc1102a382279faa9cbb26ad08db2e97c1c021e
            imageID: docker-pullable://gcr.io/tekton-prow-1/tpipeline-e2e-img/entrypoint-bff0a22da108bc2f16c818c97641a296@sha256:e96742b774189b453add683ebc68df229b70794f82219300adee53413c01f69a
            lastState: {}
            name: place-tools
            ready: true
            restartCount: 0
            state:
              terminated:
                containerID: docker://cb458054adb536e1bed3dc169e236e12b2010de15dc1109da3e39b3781581b53
                exitCode: 0
                finishedAt: "2021-10-19T18:40:33Z"
                reason: Completed
                startedAt: "2021-10-19T18:40:33Z"
          - containerID: docker://5d4c347e735da93b1514a3e003e365ef6b558af7b4504eb54a0711a9b7826101
            image: sha256:ac2afeee8f96767b2362a686e98503c9e1d1d24375be9896a0109b80e9cab631
            imageID: docker-pullable://gcr.io/distroless/base@sha256:aa4fd987555ea10e1a4ec8765da8158b5ffdfef1e72da512c7ede509bc9966c4
            lastState: {}
            name: place-scripts
            ready: true
            restartCount: 0
            state:
              terminated:
                containerID: docker://5d4c347e735da93b1514a3e003e365ef6b558af7b4504eb54a0711a9b7826101
                exitCode: 0
                finishedAt: "2021-10-19T18:40:35Z"
                reason: Completed
                startedAt: "2021-10-19T18:40:35Z"
          phase: Running

snippet of the taskrun yaml from the failure:

        status:
          completionTime: "2021-10-19T18:41:29Z"
          conditions:
          - lastTransitionTime: "2021-10-19T18:41:29Z"
            message: TaskRun "workspace-in-sidecar-rtmjf" failed to finish within "1m0s"
            reason: TaskRunTimeout
            status: "False"
            type: Succeeded
          podName: workspace-in-sidecar-rtmjf-pod-hj2bc
          startTime: "2021-10-19T18:40:29Z"
          steps:
          - container: step-unnamed-0
            imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
            name: unnamed-0
            terminated:
              exitCode: 1
              finishedAt: "2021-10-19T18:41:29Z"
              reason: TaskRunTimeout
              startedAt: "2021-10-19T18:41:26Z"

ghost · 2021-11-02T18:35:44Z

Under Events I also notice the following errors related to the emptyDir volume that the sidecar and step are supposd to share:

arendelle-4mlqr    4m26s       Warning   Failed                         pod/workspace-in-sidecar-phrks-pod-x95zf                                  Error: cannot find volume "ws-76s6x" to mount into container "step-unnamed-0"
arendelle-4mlqr    4m26s       Normal    Pulled                         pod/workspace-in-sidecar-phrks-pod-x95zf                                  Container image "alpine:3.12.0" already present on machine
arendelle-4mlqr    4m26s       Warning   Failed                         pod/workspace-in-sidecar-phrks-pod-x95zf                                  Error: cannot find volume "ws-76s6x" to mount into container "sidecar"

Edit: - looks like that specific warning comes from the kubelet when a volume definition hasn't been supplied for a volume mount name. The pod spec for workspace-in-sidecar-phrks-pod-x95zf does include a volume named ws-76s6x however, so there must be something else going wrong here.

bobcatfish · 2021-11-02T23:06:48Z

especially weird that it wouldn't find the volume definition when we're using emptyDir! I thought maybe a PVC hadn't been created, but how is it possible for the emptyDir volume to not be found??

The workspace-in-sidecar example has been a relatively consistent flake in our CI suite. The issue #4169 was created in August describing this problem and there have been relatively regular reports related to the example since then. To a first approximation I've found 31 individual instances of the test's failure since May across 24 separate Pull Requests. Attempts to reduce the flakiness of this example have not been successful yet. So in order to reduce the amount of noise in our CI (and hopefully increase overall trust in the suite) this commit moves the example into the no-ci folder, preventing its execution, while we continue to try and figure out why it is so flakey.

xchapter7x · 2022-03-01T21:08:15Z

Once PR 4634 is merged, this flake should be resolved.

vdemeester · 2022-03-02T08:05:13Z

/close

tekton-robot · 2022-03-02T08:05:14Z

@vdemeester: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ghost · 2022-03-02T17:56:06Z

/reopen

It looks like this flaked again on #4596

The test run logs are here: https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/4596/pull-tekton-pipeline-alpha-integration-tests/1499073528324952067/

tekton-robot · 2022-03-02T17:56:08Z

@sbwsg: Reopened this issue.

In response to this:

/reopen

It looks like this flaked again on #4596

The test run logs are here: https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/4596/pull-tekton-pipeline-alpha-integration-tests/1499073528324952067/

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ghost · 2022-03-03T16:31:34Z

From the logs in that test run we see the following:

    panic.go:632: #####################################################
    panic.go:632: ### Dumping logs from Pods in the arendelle-n7xcb ###
    panic.go:632: #####################################################
    build_logs.go:37: build logs 
        >>> Container step-unnamed-0:
        
        >>> Container sidecar:
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...
        Polling channel...

So what we're seeing here is that the Step never created the fifo special file for communicating between the containers. Either that or the sidecar couldn't see the fifo special file for some reason.

Looking at the container statuses it looks like both were ready and started at the same time:

          containerStatuses:
          - containerID: docker://d37e381690e0848177f126e3c85600adcf0d543ea87030389be710a4e0f7dc06
            image: alpine:3.12.0
            imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
            lastState: {}
            name: sidecar
            ready: true
            restartCount: 0
            started: true
            state:
              running:
                startedAt: "2022-03-02T17:45:33Z"
          - containerID: docker://8da5c68a979d2503c2e5c950df756171fcf63a3cec89e6139343a93633a80c2b
            image: alpine:3.12.0
            imageID: docker-pullable://alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
            lastState: {}
            name: step-unnamed-0
            ready: true
            restartCount: 0
            started: true
            state:
              running:
                startedAt: "2022-03-02T17:45:33Z"

The TaskRun's start time is recorded as 2022-03-02T17:45:12Z and the pod's creation time as 2022-03-02T17:45:12Z. So it looks like it took about 20 seconds from pod creation to step + sidecar container starts. Is it possible that it then took > 40 seconds for the step to create the fifo file? Or that ... like ... (waves hands) ... the processes aren't being given cpu cycles enough to run in parallel ... or something?

xchapter7x · 2022-03-03T17:12:56Z

yeah, its def strange. (abundant hand waving to come 👋 ) 😄
the mkfifo call should at least error if it was called and failed, and we should see the subsequent debug echos, but we arent seeing anything in the logs from the step container. which seems to indicate either, A) the script is never getting called due to some resource constraint or something, as you suggest, or B) mkfifo is locked when its called.

to check (A) we could add another debug echo before the mkfifo call, to check if the script itself is even getting called.

To check (B) The only thing that I can think of is that, if it is some kind of race condition causing a lock in our use of mkfifo, then maybe switching the creation of the named pipe to the sidecar would help. Reason being the sidecar is created before the step container, right? This could better ensure that the pipe is created before we start reading/writing/checking it.

abayer · 2022-04-19T14:08:41Z

I've tried to reproduce this, and fail every time. The failures I've seen in CI have all been container "step-unnamed-0" in pod "workspace-in-sidecar-vpnrg-pod" is waiting to start: PodInitializing (with different pod names, of course). I'm wondering if tearDown for e2e tests should also do a kubectl describe pod when it can't get logs for a pod due to PodInitializing?

abayer · 2022-05-17T15:28:02Z

I'm now fairly convinced that sidecar tests are just particularly vulnerable to limited resources on the cluster, and are therefore more likely to just take too long or otherwise flake out. Hopefully #4871 will improve this a bit by decreasing the number of e2e tests run in parallel from 16 to 4.

tekton-robot · 2022-08-15T16:00:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

abayer · 2022-08-15T16:03:05Z

/lifecycle frozen

vdemeester · 2024-06-19T13:23:50Z

I am going to close this as it doesn't seem to be as flake and I think a lot of PR went in to fix this issue one way or the other.

ghost added the kind/bug Categorizes issue or PR as related to a bug. label Aug 17, 2021

bobcatfish added kind/flake Categorizes issue or PR as related to a flakey test good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 13, 2021

jerop mentioned this issue Oct 19, 2021

updating release cheat sheet #4322

Merged

5 tasks

ghost mentioned this issue Nov 2, 2021

Rewrite flakey workspace-in-sidecar example #4349

Merged

4 tasks

jerop mentioned this issue Nov 10, 2021

Stop executing retried Tasks past their timeouts #4368

Closed

5 tasks

ghost mentioned this issue Nov 17, 2021

Put workspace-in-sidecar example into no-ci #4380

Merged

4 tasks

bobcatfish removed the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Dec 16, 2021

bobcatfish mentioned this issue Jan 28, 2022

Add doc links for 0.29.1, 0.30.1, 0.31.1 and 0.32.1 #4530

Merged

4 tasks

jerop added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 7, 2022

lbernick added this to Pipelines V1 Feb 7, 2022

lbernick moved this to Todo in Pipelines V1 Feb 22, 2022

xchapter7x mentioned this issue Mar 1, 2022

workspace-in-sidecar flakey test fix #4634

Merged

2 tasks

tekton-robot closed this as completed Mar 2, 2022

Repository owner moved this from Todo to Done in Pipelines V1 Mar 2, 2022

tekton-robot reopened this Mar 2, 2022

ghost mentioned this issue Mar 3, 2022

Add log line to debg workspace-in-sidecar example #4646

Merged

4 tasks

jerop mentioned this issue Mar 15, 2022

set the param type based on the default value #4608

Merged

5 tasks

ghost mentioned this issue Mar 17, 2022

Update the release cheatsheet to the new release drafter #4677

Merged

5 tasks

jerop mentioned this issue Mar 23, 2022

Add example and docs for array param with defaults 📜 #4518

Merged

4 tasks

abayer mentioned this issue Apr 21, 2022

Instrument e2e pipelinerun_test.go files for logstream #4782

Merged

5 tasks

jerop mentioned this issue Jun 24, 2022

Validation for Finally Task Results referenced in Pipeline Results #5000

Merged

5 tasks

abayer mentioned this issue Jul 14, 2022

Flakey sidecar container test #2656

Closed

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2022

tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 15, 2022

xchapter7x added this to Tekton Community Roadmap Sep 20, 2022

xchapter7x moved this to Todo in Tekton Community Roadmap Sep 20, 2022

jerop mentioned this issue Mar 28, 2023

Split up and refactor isCustomTask #6447

Merged

4 tasks

Yongxuanzhang mentioned this issue May 2, 2023

refactor trusted resources verification to move verification to reconcile #6502

Closed

7 tasks

Yongxuanzhang mentioned this issue May 18, 2023

[TEP-0091] use VerificationResult in verify #6673

Merged

7 tasks

lbernick mentioned this issue May 26, 2023

Validate beta features only when v1 Tasks and Pipelines are defined #6701

Merged

4 tasks

JeromeJu mentioned this issue Jun 7, 2023

Refactor test cases for remote TaskRef #6778

Merged

4 tasks

lbernick mentioned this issue Jun 13, 2023

Remove logic setting resolvers feature flag in e2e tests #6786

Merged

4 tasks

Yongxuanzhang mentioned this issue Oct 17, 2023

TEP-140: Produce Results in Matrix #7167

Merged

7 tasks

afrittoli added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 31, 2023

JeromeJu mentioned this issue Dec 1, 2023

[release-v0.53.x] Fix validations for Sidecars to be consistent #7446

Merged

vdemeester closed this as completed Jun 19, 2024

github-project-automation bot moved this from Todo to Done in Tekton Community Roadmap Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`workspace-in-sidecar` example test is very flakey #4169

`workspace-in-sidecar` example test is very flakey #4169

ghost commented Aug 17, 2021 •

edited by ghost

Loading

ghost commented Aug 18, 2021 •

edited by ghost

Loading

jerop commented Oct 19, 2021

ghost commented Nov 2, 2021 •

edited by ghost

Loading

bobcatfish commented Nov 2, 2021

xchapter7x commented Mar 1, 2022

vdemeester commented Mar 2, 2022

tekton-robot commented Mar 2, 2022

ghost commented Mar 2, 2022

tekton-robot commented Mar 2, 2022

ghost commented Mar 3, 2022

xchapter7x commented Mar 3, 2022 •

edited

Loading

abayer commented Apr 19, 2022

abayer commented May 17, 2022

tekton-robot commented Aug 15, 2022

abayer commented Aug 15, 2022

vdemeester commented Jun 19, 2024

workspace-in-sidecar example test is very flakey #4169

workspace-in-sidecar example test is very flakey #4169

Comments

ghost commented Aug 17, 2021 • edited by ghost Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

ghost commented Aug 18, 2021 • edited by ghost Loading

jerop commented Oct 19, 2021

ghost commented Nov 2, 2021 • edited by ghost Loading

bobcatfish commented Nov 2, 2021

xchapter7x commented Mar 1, 2022

vdemeester commented Mar 2, 2022

tekton-robot commented Mar 2, 2022

ghost commented Mar 2, 2022

tekton-robot commented Mar 2, 2022

ghost commented Mar 3, 2022

xchapter7x commented Mar 3, 2022 • edited Loading

abayer commented Apr 19, 2022

abayer commented May 17, 2022

tekton-robot commented Aug 15, 2022

abayer commented Aug 15, 2022

vdemeester commented Jun 19, 2024

`workspace-in-sidecar` example test is very flakey #4169

`workspace-in-sidecar` example test is very flakey #4169

ghost commented Aug 17, 2021 •

edited by ghost

Loading

ghost commented Aug 18, 2021 •

edited by ghost

Loading

ghost commented Nov 2, 2021 •

edited by ghost

Loading

xchapter7x commented Mar 3, 2022 •

edited

Loading