Skip to content

Conversation

@cyphar
Copy link
Member

@cyphar cyphar commented Jan 13, 2016

There are two sets of tests failing in Docker with the current state of runc. This PR solves the container states part by reverting it:

  • Containers don't seem to be killed properly, causing a bunch of tests using docker ps to fail. This was caused by @crosbymichael's commit fa24ebf26c0a0c26880bd8375b81df8f91a0bbb4.

The other issue is fixed by #495:

  • Updating a container doesn't work for the memory cgroup, causing docker update tests to fail. This was caused by my commit f36ed4b174bb7e8951db6b61e2202a45a7251e30.

We should merge both in order to vendor libcontainer into docker.

@cyphar cyphar changed the title Revert "Merge pull request #311 from crosbymichael/destory-state" fix failing Docker tests Jan 13, 2016
@cyphar
Copy link
Member Author

cyphar commented Jan 13, 2016

The tests now pass on Docker master (as evidenced in the current state of the testbed PR moby/moby#19250).

/ping @crosbymichael @mrunalp @dqminh @LK4D4 @jfrazelle

@LK4D4
Copy link
Contributor

LK4D4 commented Jan 13, 2016

LGTM
However, we need to find reproduction case for container destroy failure.

@cyphar
Copy link
Member Author

cyphar commented Jan 13, 2016

@LK4D4 Definitely. I think @estesp had some progress on this, but it shouldn't be too hard to root-cause which DockerSuite test case caused this issue, and then figure out what part of the state transition stuff broke it (just figure out what the container name is and work backwards).

@mrunalp
Copy link
Contributor

mrunalp commented Jan 13, 2016

I am okay with this but thinking whether triaging/fixing the issue is a better option. @crosbymichael WDYT?

@hqhq
Copy link
Contributor

hqhq commented Jan 14, 2016

I'm never fan of reverting big patch sets, since we still have more than two weeks bugfix window before 1.10, can we try to address the failed case and fix it?

@cyphar
Copy link
Member Author

cyphar commented Jan 14, 2016

@hqhq I'll do my best (some help from @crosbymichael would be nice, since he wrote it), but I'm a bit worried about the fact that we can't seem to reproduce this on the CLI (and the test failures are quite ... weird in nature).

Can we at least agree that if we don't solve this soon, we apply this quite a ways before 1.10 (quite a few PRs are waiting for the tests to be fixed -- including security fixes, the PIDs cgroup stuff, some userns stuff, and other things that are all tagged for the milestone).

@hqhq
Copy link
Contributor

hqhq commented Jan 14, 2016

Can we at least agree that if we don't solve this soon, we apply this quite a ways before 1.10

Sure.

@crosbymichael
Copy link
Member

Let me look into the failures and see if we can fix the issues instead of reverting the changes.

@marcosnils
Copy link
Contributor

@crosbymichael @cyphar I've just vendored latest runC master into Docker and I only get one test failing

----------------------------------------------------------------------
FAIL: docker_cli_events_unix_test.go:48: DockerSuite.TestEventsOOMDisableFalse

docker_cli_events_unix_test.go:61:
    c.Assert(err, checker.IsNil)
... value *errors.errorString = &errors.errorString{s:"wrong exit code for OOM container: expected 137, got 125 (output: \"WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.\\ndocker: Error response from daemon: Cannot start container 19f650e9c606f8411fbc81908872ef1d5dc6ce1998f84951740806a0152aad43: [9] System error: read parent: connection reset by peer.\\n\")"} ("wrong exit code for OOM container: expected 137, got 125 (output: \"WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.\\ndocker: Error response from daemon: Cannot start container 19f650e9c606f8411fbc81908872ef1d5dc6ce1998f84951740806a0152aad43: [9] System error: read parent: connection reset by peer.\\n\")")


----------------------------------------------------------------------

@cyphar is this the same error you were getting?. Because in your original PR I think I saw several errors.

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2016

@marcosnils No, that error is because your kernel doesn't support OOMDisable. It also looks like it doesn't support swap limiting as well. The errors we're talking about can be reproduced with the Jenkins CI (which supports all of the latest Docker-related kernel features other than the PIDs cgroup AFAIK).

There are several test errors (which are consistent on Jenkins) that are fixed with this PR. If you want to see what the errors look like, here's the console output: https://jenkins.dockerproject.org/job/Docker-PRs/22345/console. The ps related errors are caused by a container not being properly stopped / deleted. One other error related to memory is related to my cgroup changes.

@marcosnils
Copy link
Contributor

@cyphar I didn't get any errors when running the tests with the latest runC version. In fact, they all passed except for the OOM which seems to be a missing kernel feature as you mentioned.

Here's my output:

http://pastebin.com/TkuuwQJM

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2016

@marcosnils Are you sure? I'm fairly sure that this test simply must unambiguously fail (the code just doesn't exist in runc for it to work).

PASS: docker_cli_update_unix_test.go:15: DockerSuite.TestUpdateRunningContainer 0.454s

In any case, what we're testing is whether the test suite running on the "official" Janky servers. I can't reproduce the errors on my local machine either (but @LK4D4 can reproduce them on his machine), so I can't comment why you're not seeing the errors.

The fact that this test is skipped might have something to do with why you don't see one of the errors:

SKIP: docker_api_update_unix_test.go:12: DockerSuite.TestApiUpdateContainer (Test requires an environment that supports cgroup swap memory limit.)

@marcosnils
Copy link
Contributor

@cyphar thanks. I'm aware that tests need to pass on janky, just wanted to remark that some env stuff is going on here as tests do not fail consistently. 😄.

FWIW I ran them in a clean/stock ubuntu 14.04 env.

@cyphar
Copy link
Member Author

cyphar commented Jan 15, 2016

I'm going to try to re-run the tests on my local machine with 4.4.0 on Arch (custom-compiled to have all of the relevant kernel features). It'd be nice if @jfrazelle could give us the hudson.sh script so we can try to recreate the Jenkins environment locally.

@crosbymichael
Copy link
Member

@cyphar i can repo locally and will be working on this today

@jimmidyson
Copy link
Contributor

@crosbymichael How's this going? Do you think #476 fixed this for Docker?

@cyphar
Copy link
Member Author

cyphar commented Jan 20, 2016

@jimmidyson The bug we're investigating is related to state transition stuff. The resource pointer stuff was related to backwards compatibility with old configurations.

@jimmidyson
Copy link
Contributor

@cyphar OK was just a bit confused by the comment in moby/moby#19469 (comment) - sorry!

@cyphar
Copy link
Member Author

cyphar commented Jan 20, 2016

Ah, right. It took me a bit to figure out that that PR vendors a branch of runc. I think that's a very bad idea, and that's what my comment was about.

EDIT: It's now been merged. Brilliant. /s

@crosbymichael
Copy link
Member

@cyphar EDIT: It's now been merged. Brilliant. /s

No need to be rude just because things don't work the way you think they should.

@crosbymichael crosbymichael changed the title fix failing Docker tests Revert memcg set and container states Jan 21, 2016
@cyphar
Copy link
Member Author

cyphar commented Jan 21, 2016

@crosbymichael Sorry, I was in a bad mood at the time. Mainly, I don't like the fact that it'll make bisecting Docker to figure out what patches to backport much more annoying than it currently is.

However, I also think it's incredibly rude to not even acknowledge grievances (that are IMO valid) about a patch from a member of the community that's been working on Docker for quite a while.

I understand that Docker doesn't care about backporting fixes to older versions, so I understand why that was a valid compromise for you. But can you see the other side of the argument? Someone in a few months is going to be bisecting Docker and will hit a 50-patch window where hack/vendor.sh no longer works (hint: that person will almost certainly be me, or will be someone who doesn't know why hack/vendor.sh is referring to a commit that doesn't exist).

@cyphar cyphar changed the title Revert memcg set and container states Fix memcg set and revert container states Jan 21, 2016
@crosbymichael
Copy link
Member

@cyphar no problem we will get it fixed. I almost of the state stuff done but am willing to revert it but we should do your fix and the state stuff in two prs. right now there are other things broken in master, for the exec tests we are getting a file not found now.

We vendored a commit on a branch because it's not wise to vendor master right now while docker is in a code freeze so we are just cherry picking in bug fixes that are needed at the moment until we can get things working again. So even if we merged this, it is still a risky move to vendor all of libcontainer at the moment.

@crosbymichael
Copy link
Member

@cyphar can you split this PR up into two?

@cyphar
Copy link
Member Author

cyphar commented Jan 21, 2016

@crosbymichael Sure. I'll move the memcg fixes to another PR (#495).

This reverts commit fa24ebf, reversing
changes made to 5a398b5.

The Docker test suite fails if this set of commits are in runC, so
reverse them as a temporary measure until the underlying issue is
solved. It appears as though processes are not being killed properly.

Cc: Michael Crosby <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar cyphar changed the title Fix memcg set and revert container states Revert container states Jan 21, 2016
@cyphar
Copy link
Member Author

cyphar commented Jan 25, 2016

Closing in favor of #499 (#495 has been merged).

@cyphar cyphar closed this Jan 25, 2016
@cyphar cyphar deleted the revert-container-states branch January 25, 2016 09:59
stefanberger pushed a commit to stefanberger/runc that referenced this pull request Sep 8, 2017
These are optional on multiple platforms and should be left up to the
runtime/host system for validation.

Closes opencontainers#470

Signed-off-by: Michael Crosby <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants