Attempt to mimic changes from upstream PR#27467 #212

nalind · 2016-10-28T20:17:59Z

Problem: if a process that's been started for 'docker exec' exits fast enough, the daemon can receive a 'process exited' update from containerd before it starts passing stdio data back and forth, losing its output.

Why it happens: the 'process exited' state change message is only processed after acquiring a lock for the container, and while most of the exec() setup is done while holding that lock, the lock is freed when libcontainerd calls back to set up the passing of stdio data.

Proposed change: call getExecConfig() before AddProcess(), and modify AddProcess() to take a callback that the daemon can hand it that remembers the exec.Config value that was retrieved.

The effect should be comparable to the changes in moby#27467 (in that we now hold the libcontainerd client lock, and make sure that we don't obtain the Container object's lock by calling getExecConfig()), but without additional refactoring.

Problem: if a process that's been started for 'docker exec' exits fast enough, the daemon can receive a 'process exited' update from containerd before it starts passing stdio data back and forth, losing its output. Why it happens: the 'process exited' state change message is only processed after acquiring a lock for the container, and while most of the exec() setup is done while holding that lock, the lock is freed when libcontainerd calls back to set up the passing of stdio data. Proposed change: call getExecConfig() before AddProcess(), and modify AddProcess() to take a callback that the daemon can hand it that remembers the exec.Config value that was retrieved. The effect should be the same as that of the changes in moby#27467, but without as much refactoring. Signed-off-by: Nalin Dahyabhai <[email protected]>

runcom · 2016-10-28T20:22:11Z

@ncdc could you test this out?

ncdc · 2016-10-28T20:23:05Z

Can we get a brew/koji build?

nalind · 2016-10-28T20:23:56Z

@ncdc for which releases?

ncdc · 2016-10-28T20:24:38Z

@nalind I have Fedora 24, RHEL 7.2, and RHEL 7.3 VMs. You pick? 😄

nalind · 2016-10-28T20:48:25Z

Try http://koji.fedoraproject.org/koji/taskinfo?taskID=16242921 for f24 scratch builds. Thanks!

jwhonce · 2016-10-28T21:05:10Z

retest this please

jwhonce · 2016-10-28T21:19:53Z

@rh-atomic-bot retest this please

imcleod · 2016-10-28T21:20:57Z

#209 - a completely different PR 3 days ago produces the same fatal error on testing (I believe)

runcom · 2016-10-28T21:21:12Z

@jwhonce tests are busted in RH CI, I usually run them locally before merging PR here

jwhonce · 2016-10-28T21:23:27Z

@runcom So we have this busted-ness and the containers/image busted-ness. This is being tested elsewhere. Do you know if there is any magic to get the tests to re-run other than adding a new commit?

runcom · 2016-10-28T21:53:49Z

@jwhonce containers/image isn't busted anymore (your PR over there just requires a rebase now and it will be good to go).

As for the tests here, they are busted for various reason, such as, tests virtual machines running out of memory, cross compilation broken by some of our patches and whatnot. I would suggest having a Trello card to track a fix to our testing for projectatomic/docker and in the meantime just running tests locally as I do everytime.

ncdc · 2016-10-31T13:27:44Z

FYI I wasn't able to begin testing this until just now. I'll keep the infinite loop running as long as it will go and report if/when I see it break.

ncdc · 2016-10-31T17:17:21Z

So far so good. Still running. No flakes.

ncdc · 2016-10-31T17:18:01Z

(F24)

docker-latest-1.12.2-2.git8f1975c.fc24.0.bz1389474.1.x86_64
docker-common-1.12.2-5.git8f1975c.fc25.x86_64
container-selinux-1.12.2-5.git8f1975c.fc25.x86_64

ncdc · 2016-10-31T20:50:31Z

I'm not sure exactly how long it took, but my reproducer was able to reproduce the issue using the F24 RPMs listed above.

imcleod · 2016-10-31T21:50:36Z

Taking a closer look at the F24 candidate builds here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=16242921

I'm afraid it looks like they do not actually contain this fix. This commit is included in the SRPM as a patch but is not applied during the %prep process. I'm re-running a build with these patches included and applied on top of docker-latest-1.12.3 and will update with a link if it is successful.

imcleod · 2016-10-31T22:02:46Z

The x86_64 portion of the build has now finished successfully here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=16262294

You can confirm that the attach patch has applied in the build log:

https://kojipkgs.fedoraproject.org//work/tasks/2294/16262294/build.log

Apologies @ncdc but can you restart testing with this build?

nalind · 2016-11-01T01:49:50Z

My apologies. I mistakenly assumed that the Fedora .spec used %autosetup, like the EL .spec does, and didn't double-check it.

imcleod · 2016-11-01T01:55:12Z

@nalind - No worries

imcleod · 2016-11-01T12:30:02Z

@ncdc - Is your reproducer something sufficiently well documented and/or scripted that some others of us could run it locally?

ncdc · 2016-11-01T12:30:56Z

Yes. See the docker github issue I opened.

On Tuesday, November 1, 2016, Ian McLeod [email protected] wrote:

@ncdc https://github.com/ncdc - Is your reproducer something
sufficiently well documented and/or scripted that some others of us could
run it locally?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#212 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYu9cacC04AjzfONrL3gQQfONM2Bbks5q5zDLgaJpZM4Kj0c1
.

imcleod · 2016-11-01T12:38:59Z

Ahh yes. This:

moby#27289

@ncdc - Many thanks.

@runcom, @nalind, @jwhonce - See reproducer script in the issue linked above.

ncdc · 2016-11-01T13:52:40Z

I just kicked off a new test using the new F24 RPM.

runcom · 2016-11-08T10:36:34Z

Any update?

ncdc · 2016-11-08T11:37:43Z

I wasn't able to get it to flake

On Tue, Nov 8, 2016 at 5:36 AM Antonio Murdaca [email protected]
wrote:

Any update?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#212 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYv8Oh5GVBYv40GZ34n4ko7K_oD1iks5q8FC0gaJpZM4Kj0c1
.

runcom · 2016-11-08T11:39:21Z

Great!!! ping @rhatdan @imcleod @jwhonce

ddarrah · 2016-11-15T12:38:03Z

it failed for me last night on rhel 7.3.0 and 1.12.3-4. Here is my setup and load:

running in container
running in container
running in container
running in container
running in container
rpc error: code = 2 desc = containerd: container not started
OMG
[root@rhel-730 ~]# 
[root@rhel-730 ~]# docker ps
CONTAINER ID        IMAGE                                     COMMAND                  CREATED             STATUS              PORTS                            NAMES
ea8413ec4d93        httpd                                     "httpd-foreground"       14 hours ago        Up 14 hours         80/tcp                           loving_agnesi
165d5a7aba11        nginx                                     "nginx -g 'daemon off"   16 hours ago        Up 16 hours         80/tcp, 443/tcp                  nginx
04f9249f23e9        cockpit/kubernetes:latest                 "/usr/libexec/cockpit"   16 hours ago        Up 16 hours         0.0.0.0:9090->9090/tcp           atomic-registry-console
5d3bd0447229        openshift/origin:latest                   "/usr/bin/openshift s"   16 hours ago        Up 16 hours         53/tcp, 0.0.0.0:8443->8443/tcp   atomic-registry-master
b1f6b13574e2        openshift/origin-docker-registry:latest   "/bin/sh -c 'DOCKER_R"   16 hours ago        Up 16 hours                                          atomic-registry
[root@rhel-730 ~]# docker version
Client:
 Version:         1.12.3
 API version:     1.24
 Package version: docker-common-1.12.3-4.el7.x86_64
 Go version:      go1.6.2
 Git commit:      f320458-redhat
 Built:           Mon Nov  7 10:15:24 2016
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.3
 API version:     1.24
 Package version: docker-common-1.12.3-4.el7.x86_64
 Go version:      go1.6.2
 Git commit:      f320458-redhat
 Built:           Mon Nov  7 10:15:24 2016
 OS/Arch:         linux/amd64
[root@rhel-730 ~]#

ncdc · 2016-11-15T13:53:18Z

I commented on the bz. This is a different failure condition.

nalind force-pushed the docker-1.12.3-attach branch from 1b53c8b to 63187af Compare October 31, 2016 13:47

rhatdan merged commit cffb114 into projectatomic:docker-1.12.3 Nov 8, 2016

Attempt to mimic changes from upstream PR#27467 #212

Attempt to mimic changes from upstream PR#27467 #212

Uh oh!

Conversation

nalind commented Oct 28, 2016

Uh oh!

runcom commented Oct 28, 2016

Uh oh!

ncdc commented Oct 28, 2016

Uh oh!

nalind commented Oct 28, 2016

Uh oh!

ncdc commented Oct 28, 2016

Uh oh!

nalind commented Oct 28, 2016

Uh oh!

jwhonce commented Oct 28, 2016

Uh oh!

jwhonce commented Oct 28, 2016

Uh oh!

imcleod commented Oct 28, 2016

Uh oh!

runcom commented Oct 28, 2016

Uh oh!

jwhonce commented Oct 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented Oct 28, 2016

Uh oh!

ncdc commented Oct 31, 2016

Uh oh!

ncdc commented Oct 31, 2016

Uh oh!

ncdc commented Oct 31, 2016

Uh oh!

ncdc commented Oct 31, 2016

Uh oh!

imcleod commented Oct 31, 2016

Uh oh!

imcleod commented Oct 31, 2016

Uh oh!

nalind commented Nov 1, 2016

Uh oh!

imcleod commented Nov 1, 2016

Uh oh!

imcleod commented Nov 1, 2016

Uh oh!

ncdc commented Nov 1, 2016

Uh oh!

imcleod commented Nov 1, 2016

Uh oh!

ncdc commented Nov 1, 2016

Uh oh!

runcom commented Nov 8, 2016

Uh oh!

ncdc commented Nov 8, 2016

Uh oh!

runcom commented Nov 8, 2016

Uh oh!

ddarrah commented Nov 15, 2016 • edited by runcom Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ncdc commented Nov 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jwhonce commented Oct 28, 2016 •

edited

Loading

ddarrah commented Nov 15, 2016 •

edited by runcom

Loading