Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process valueFrom in build strategy environment variables test context should successfully resolve valueFrom in s2i build environment variables #16994

Closed
0xmichalis opened this issue Oct 22, 2017 · 19 comments
Assignees
Labels
component/containers kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1 sig/containers

Comments

@0xmichalis
Copy link
Contributor

/tmp/openshift/build-rpm-release/tito/rpmbuild-originsmcihc/BUILD/origin-3.7.0/_output/local/go/src/github.com/openshift/origin/test/extended/builds/valuefrom.go:77
Expected
    <bool>: false
to be true
/tmp/openshift/build-rpm-release/tito/rpmbuild-originsmcihc/BUILD/origin-3.7.0/_output/local/go/src/github.com/openshift/origin/test/extended/builds/valuefrom.go:66

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16915/test_pull_request_origin_extended_conformance_gce/10226/
/assign bparees
/kind test-flake

@bparees
Copy link
Contributor

bparees commented Oct 22, 2017

Oct 22 19:50:23.636: INFO: 2017-10-22T19:50:17.942748000Z container_linux.go:247: starting container process caused "process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod3a1c7d43_b762_11e7_8538_42010a8e0002.slice/docker-2ef45289f4558b73fc22ee000087862c4f6b82638d4c4e4c0141b2aee429a945.scope/cgroup.procs: no such file or directory\""

@jwhonce looks like containers team to me.

@bparees
Copy link
Contributor

bparees commented Oct 22, 2017

@Kargakis any reason it's not p1?

@0xmichalis
Copy link
Contributor Author

No, p1 sgtm if we use it as a means to draw attention.

@bparees
Copy link
Contributor

bparees commented Oct 22, 2017

attention and the implication of its impact on others (by blocking the queue). And in this case it's also a pretty worrying error unless we know why it would only occur in our test env.

@jim-minter
Copy link
Contributor

@jwhonce I spent some time looking at this, somewhat in vain - the best thing I came up with was that everywhere in runc where dbus calls to systemd are being made, and especially at https://github.com/opencontainers/runc/blob/b2567b37d7b75eb4cf325b77297b140ea686ce8f/libcontainer/cgroups/systemd/apply_systemd.go#L291, runc doesn't wait for systemd to reply via dbus that it has completed the task, so runc and systemd race. However, I wasn't able to reason a way in which that race would cause this specific issue.

I'm reasonably convinced that the kernel ensures that appearance of the cgroup.procs file in a cgroup directory is atomic with respect to the creation of the cgroup directory (no possible window between mkdir succeeding and a stat on the file failing). If that's the case, then I think that there must be two racing userland processes. The recursive mkdir code used by Go and systemd looks robust to me in the face of racing recursive mkdirs. Alternatively, perhaps one process is doing a recursive mkdir (instances can be found in runc and systemd), and another an rmdir? But, I couldn't see evidence of systemd doing that via strace; there is runc code that would rmdir that path, but I don't see why it how it could be running concurrently.

Don't know if any of that helps at all? Is it worth comparing notes with someone on the systemd team?

It seemed like a lot of effort to pull a runc patch to solve the dbus race all the way through to OpenShift CI just on a hunch that it would solve this issue, so I didn't try. Perhaps more debugging code, or just a simple retry added to runc might be the best plan? (possibly in addition to solving the dbus race at the same time?)

@runcom
Copy link
Member

runcom commented Nov 30, 2017

@jim-minter we're seeing the very same race in CRI-O as well using systemd and runc. I think we can add a stopgap in runc to spin till the file pops up for the pids cgroup, otherwise, I suspect it's a systemd issue cause runc code hasn't changed that much over the weeks.

@runcom
Copy link
Member

runcom commented Nov 30, 2017

@jim-minter
Copy link
Contributor

@vikaschoudhary16 don't know if comment #16994 (comment) is of interest but wanted to bring it to your attention.

@runcom
Copy link
Member

runcom commented Nov 30, 2017

/cc @mrunalp

@jim-minter
Copy link
Contributor

@vikaschoudhary16 I'm not really qualified to answer 😕 but I'll ask: if a missing m.mu.Lock() there is the root issue, can you explain how it causes the symptoms seen?

@vikaschoudhary16
Copy link
Contributor

@jim-minter Failures are occurring in manager.Apply() and looks to me that it can happen only if there a parallel manager.Destroy() gets invoked and that executed rmdir eventually. Then I noticed that said lock is missing in Apply(). If my guess about the scenario is correct, a lock in Apply() would hold Destroy() from proceeding until cgroup join is complete.
But still not sure exactly in what scenario above will happen.

Here is related PR, opencontainers/runc#1668
And for testing: cri-o/cri-o#1205

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/containers kind/test-flake Categorizes issue or PR as related to test flakes. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1 sig/containers
Projects
None yet
Development

No branches or pull requests