process valueFrom in build strategy environment variables test context should successfully resolve valueFrom in s2i build environment variables #16994

0xmichalis · 2017-10-22T20:13:38Z

/tmp/openshift/build-rpm-release/tito/rpmbuild-originsmcihc/BUILD/origin-3.7.0/_output/local/go/src/github.com/openshift/origin/test/extended/builds/valuefrom.go:77
Expected
    <bool>: false
to be true
/tmp/openshift/build-rpm-release/tito/rpmbuild-originsmcihc/BUILD/origin-3.7.0/_output/local/go/src/github.com/openshift/origin/test/extended/builds/valuefrom.go:66

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16915/test_pull_request_origin_extended_conformance_gce/10226/
/assign bparees
/kind test-flake

The text was updated successfully, but these errors were encountered:

bparees · 2017-10-22T20:15:53Z

Oct 22 19:50:23.636: INFO: 2017-10-22T19:50:17.942748000Z container_linux.go:247: starting container process caused "process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod3a1c7d43_b762_11e7_8538_42010a8e0002.slice/docker-2ef45289f4558b73fc22ee000087862c4f6b82638d4c4e4c0141b2aee429a945.scope/cgroup.procs: no such file or directory\""

@jwhonce looks like containers team to me.

bparees · 2017-10-22T20:16:47Z

@Kargakis any reason it's not p1?

0xmichalis · 2017-10-22T20:18:50Z

No, p1 sgtm if we use it as a means to draw attention.

bparees · 2017-10-22T20:21:11Z

attention and the implication of its impact on others (by blocking the queue). And in this case it's also a pretty worrying error unless we know why it would only occur in our test env.

jim-minter · 2017-11-09T20:59:19Z

@jwhonce I spent some time looking at this, somewhat in vain - the best thing I came up with was that everywhere in runc where dbus calls to systemd are being made, and especially at https://github.com/opencontainers/runc/blob/b2567b37d7b75eb4cf325b77297b140ea686ce8f/libcontainer/cgroups/systemd/apply_systemd.go#L291, runc doesn't wait for systemd to reply via dbus that it has completed the task, so runc and systemd race. However, I wasn't able to reason a way in which that race would cause this specific issue.

I'm reasonably convinced that the kernel ensures that appearance of the cgroup.procs file in a cgroup directory is atomic with respect to the creation of the cgroup directory (no possible window between mkdir succeeding and a stat on the file failing). If that's the case, then I think that there must be two racing userland processes. The recursive mkdir code used by Go and systemd looks robust to me in the face of racing recursive mkdirs. Alternatively, perhaps one process is doing a recursive mkdir (instances can be found in runc and systemd), and another an rmdir? But, I couldn't see evidence of systemd doing that via strace; there is runc code that would rmdir that path, but I don't see why it how it could be running concurrently.

Don't know if any of that helps at all? Is it worth comparing notes with someone on the systemd team?

It seemed like a lot of effort to pull a runc patch to solve the dbus race all the way through to OpenShift CI just on a hunch that it would solve this issue, so I didn't try. Perhaps more debugging code, or just a simple retry added to runc might be the best plan? (possibly in addition to solving the dbus race at the same time?)

runcom · 2017-11-30T12:33:09Z

@jim-minter we're seeing the very same race in CRI-O as well using systemd and runc. I think we can add a stopgap in runc to spin till the file pops up for the pids cgroup, otherwise, I suspect it's a systemd issue cause runc code hasn't changed that much over the weeks.

runcom · 2017-11-30T13:18:56Z

CRI-O failed here https://ci.openshift.redhat.com/jenkins/job/test_pull_request_crio_e2e_rhel/433/consoleFull#199861745056cbb9a5e4b02b88ae8c2f77 for the record

jim-minter · 2017-11-30T16:40:14Z

@vikaschoudhary16 don't know if comment #16994 (comment) is of interest but wanted to bring it to your attention.

runcom · 2017-11-30T22:22:56Z

/cc @mrunalp

vikaschoudhary16 · 2017-12-04T14:30:38Z

@jim-minter Looks to me m.mu.Lock() is missing here:
https://github.com/openshift/origin/blob/master/vendor/github.com/opencontainers/runc/libcontainer/cgroups/systemd/apply_systemd.go#L290
WDYT?

jim-minter · 2017-12-04T19:14:55Z

@vikaschoudhary16 I'm not really qualified to answer 😕 but I'll ask: if a missing m.mu.Lock() there is the root issue, can you explain how it causes the symptoms seen?

vikaschoudhary16 · 2017-12-05T11:18:56Z

@jim-minter Failures are occurring in manager.Apply() and looks to me that it can happen only if there a parallel manager.Destroy() gets invoked and that executed rmdir eventually. Then I noticed that said lock is missing in Apply(). If my guess about the scenario is correct, a lock in Apply() would hold Destroy() from proceeding until cgroup join is complete.
But still not sure exactly in what scenario above will happen.

Here is related PR, opencontainers/runc#1668
And for testing: cri-o/cri-o#1205

sosiouxme · 2018-01-10T17:59:16Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_extended_conformance_install/5256/

sosiouxme · 2018-01-18T12:40:33Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/6757/test_pull_request_openshift_ansible_extended_conformance_install/4011/

sjenning · 2018-01-18T21:35:41Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18167/test_pull_request_origin_extended_conformance_gce/14496/

sdodson · 2018-01-26T20:13:15Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18304/test_pull_request_origin_extended_conformance_gce/15093/

openshift-bot · 2018-04-26T23:41:27Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-05-27T00:36:05Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2018-06-26T06:35:00Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot assigned bparees Oct 22, 2017

openshift-ci-robot added the kind/test-flake Categorizes issue or PR as related to test flakes. label Oct 22, 2017

0xmichalis added component/build priority/P2 sig/developer-experience labels Oct 22, 2017

0xmichalis mentioned this issue Oct 22, 2017

[stage] UPSTREAM: 53233: Remove containers from deleted pods once containers have exited #16915

Closed

bparees assigned jwhonce and unassigned bparees Oct 22, 2017

bparees added component/containers sig/containers and removed component/build sig/developer-experience labels Oct 22, 2017

0xmichalis added priority/P1 and removed priority/P2 labels Oct 22, 2017

jim-minter mentioned this issue Oct 23, 2017

process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory #16246

Closed

sosiouxme mentioned this issue Nov 17, 2017

GlusterFS: Add configuration for auto creating block-hosting volumes openshift/openshift-ansible#6150

Merged

runcom mentioned this issue Nov 30, 2017

Join pids cgroup with systemd causes "no such device" cri-o/cri-o#1196

Closed

sjenning mentioned this issue Jan 5, 2018

UPSTREAM: 56971: LimitRange ignores objects previously marked for deletion #17978

Merged

simo5 mentioned this issue Jan 5, 2018

Implement a way to time out tokens based on (in)activity #17640

Merged

php-coder mentioned this issue Jan 5, 2018

SCC admission plugin: extract name to a constant #17856

Merged

JacobTanenbaum mentioned this issue Jan 9, 2018

Remove numerous "Provided subnet doesn't belong to network" when configured with multiple subnets #17973

Merged

sosiouxme mentioned this issue Jan 10, 2018

diagnostics: individual parameters #17773

Merged

jwhonce assigned mrunalp and unassigned jwhonce Jan 11, 2018

spadgett mentioned this issue Jan 16, 2018

allow webconsole to discover cluster information #18075

Merged

sosiouxme mentioned this issue Jan 18, 2018

update health check required versions for 3.9 openshift/openshift-ansible#6757

Merged

sjenning mentioned this issue Jan 18, 2018

[3.7] UPSTREAM: 57422: Rework method of updating atomic-updated data volumes #18167

Merged

sdodson mentioned this issue Jan 26, 2018

[release-3.8] Add node system-container ADDLT_MOUNTS #18304

Merged

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2018

openshift-ci-robot closed this as completed Jun 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process valueFrom in build strategy environment variables test context should successfully resolve valueFrom in s2i build environment variables #16994

process valueFrom in build strategy environment variables test context should successfully resolve valueFrom in s2i build environment variables #16994

0xmichalis commented Oct 22, 2017

bparees commented Oct 22, 2017

bparees commented Oct 22, 2017

0xmichalis commented Oct 22, 2017

bparees commented Oct 22, 2017

jim-minter commented Nov 9, 2017

runcom commented Nov 30, 2017 •

edited

Loading

runcom commented Nov 30, 2017

jim-minter commented Nov 30, 2017

runcom commented Nov 30, 2017

vikaschoudhary16 commented Dec 4, 2017

jim-minter commented Dec 4, 2017

vikaschoudhary16 commented Dec 5, 2017

sosiouxme commented Jan 10, 2018

sosiouxme commented Jan 18, 2018

sjenning commented Jan 18, 2018

sdodson commented Jan 26, 2018

openshift-bot commented Apr 26, 2018

openshift-bot commented May 27, 2018

openshift-bot commented Jun 26, 2018

process valueFrom in build strategy environment variables test context should successfully resolve valueFrom in s2i build environment variables #16994

process valueFrom in build strategy environment variables test context should successfully resolve valueFrom in s2i build environment variables #16994

Comments

0xmichalis commented Oct 22, 2017

bparees commented Oct 22, 2017

bparees commented Oct 22, 2017

0xmichalis commented Oct 22, 2017

bparees commented Oct 22, 2017

jim-minter commented Nov 9, 2017

runcom commented Nov 30, 2017 • edited Loading

runcom commented Nov 30, 2017

jim-minter commented Nov 30, 2017

runcom commented Nov 30, 2017

vikaschoudhary16 commented Dec 4, 2017

jim-minter commented Dec 4, 2017

vikaschoudhary16 commented Dec 5, 2017

sosiouxme commented Jan 10, 2018

sosiouxme commented Jan 18, 2018

sjenning commented Jan 18, 2018

sdodson commented Jan 26, 2018

openshift-bot commented Apr 26, 2018

openshift-bot commented May 27, 2018

openshift-bot commented Jun 26, 2018

runcom commented Nov 30, 2017 •

edited

Loading