-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
process valueFrom in build strategy environment variables test context should successfully resolve valueFrom in s2i build environment variables #16994
Comments
@jwhonce looks like containers team to me. |
@Kargakis any reason it's not p1? |
No, p1 sgtm if we use it as a means to draw attention. |
attention and the implication of its impact on others (by blocking the queue). And in this case it's also a pretty worrying error unless we know why it would only occur in our test env. |
@jwhonce I spent some time looking at this, somewhat in vain - the best thing I came up with was that everywhere in runc where dbus calls to systemd are being made, and especially at https://github.com/opencontainers/runc/blob/b2567b37d7b75eb4cf325b77297b140ea686ce8f/libcontainer/cgroups/systemd/apply_systemd.go#L291, runc doesn't wait for systemd to reply via dbus that it has completed the task, so runc and systemd race. However, I wasn't able to reason a way in which that race would cause this specific issue. I'm reasonably convinced that the kernel ensures that appearance of the cgroup.procs file in a cgroup directory is atomic with respect to the creation of the cgroup directory (no possible window between mkdir succeeding and a stat on the file failing). If that's the case, then I think that there must be two racing userland processes. The recursive mkdir code used by Go and systemd looks robust to me in the face of racing recursive mkdirs. Alternatively, perhaps one process is doing a recursive mkdir (instances can be found in runc and systemd), and another an rmdir? But, I couldn't see evidence of systemd doing that via strace; there is runc code that would rmdir that path, but I don't see why it how it could be running concurrently. Don't know if any of that helps at all? Is it worth comparing notes with someone on the systemd team? It seemed like a lot of effort to pull a runc patch to solve the dbus race all the way through to OpenShift CI just on a hunch that it would solve this issue, so I didn't try. Perhaps more debugging code, or just a simple retry added to runc might be the best plan? (possibly in addition to solving the dbus race at the same time?) |
@jim-minter we're seeing the very same race in CRI-O as well using systemd and runc. I think we can add a stopgap in runc to spin till the file pops up for the pids cgroup, otherwise, I suspect it's a systemd issue cause runc code hasn't changed that much over the weeks. |
CRI-O failed here https://ci.openshift.redhat.com/jenkins/job/test_pull_request_crio_e2e_rhel/433/consoleFull#199861745056cbb9a5e4b02b88ae8c2f77 for the record |
@vikaschoudhary16 don't know if comment #16994 (comment) is of interest but wanted to bring it to your attention. |
/cc @mrunalp |
@jim-minter Looks to me |
@vikaschoudhary16 I'm not really qualified to answer 😕 but I'll ask: if a missing |
@jim-minter Failures are occurring in Here is related PR, opencontainers/runc#1668 |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16915/test_pull_request_origin_extended_conformance_gce/10226/
/assign bparees
/kind test-flake
The text was updated successfully, but these errors were encountered: