Skip to content

Conversation

@nalind
Copy link

@nalind nalind commented Oct 28, 2016

Problem: if a process that's been started for 'docker exec' exits fast enough, the daemon can receive a 'process exited' update from containerd before it starts passing stdio data back and forth, losing its output.

Why it happens: the 'process exited' state change message is only processed after acquiring a lock for the container, and while most of the exec() setup is done while holding that lock, the lock is freed when libcontainerd calls back to set up the passing of stdio data.

Proposed change: call getExecConfig() before AddProcess(), and modify AddProcess() to take a callback that the daemon can hand it that remembers the exec.Config value that was retrieved.

The effect should be comparable to the changes in moby#27467 (in that we now hold the libcontainerd client lock, and make sure that we don't obtain the Container object's lock by calling getExecConfig()), but without additional refactoring.

Problem: if a process that's been started for 'docker exec' exits fast
enough, the daemon can receive a 'process exited' update from containerd
before it starts passing stdio data back and forth, losing its output.

Why it happens: the 'process exited' state change message is only
processed after acquiring a lock for the container, and while most of
the exec() setup is done while holding that lock, the lock is freed when
libcontainerd calls back to set up the passing of stdio data.

Proposed change: call getExecConfig() before AddProcess(), and modify
AddProcess() to take a callback that the daemon can hand it that
remembers the exec.Config value that was retrieved.

The effect should be the same as that of the changes in
moby#27467, but without as much
refactoring.

Signed-off-by: Nalin Dahyabhai <[email protected]>
@runcom
Copy link
Collaborator

runcom commented Oct 28, 2016

@ncdc could you test this out?

@ncdc
Copy link

ncdc commented Oct 28, 2016

Can we get a brew/koji build?

@nalind
Copy link
Author

nalind commented Oct 28, 2016

@ncdc for which releases?

@ncdc
Copy link

ncdc commented Oct 28, 2016

@nalind I have Fedora 24, RHEL 7.2, and RHEL 7.3 VMs. You pick? 😄

@nalind
Copy link
Author

nalind commented Oct 28, 2016

Try http://koji.fedoraproject.org/koji/taskinfo?taskID=16242921 for f24 scratch builds. Thanks!

@jwhonce
Copy link
Collaborator

jwhonce commented Oct 28, 2016

retest this please

@jwhonce
Copy link
Collaborator

jwhonce commented Oct 28, 2016

@rh-atomic-bot retest this please

@imcleod
Copy link

imcleod commented Oct 28, 2016

#209 - a completely different PR 3 days ago produces the same fatal error on testing (I believe)

@runcom
Copy link
Collaborator

runcom commented Oct 28, 2016

@jwhonce tests are busted in RH CI, I usually run them locally before merging PR here

@jwhonce
Copy link
Collaborator

jwhonce commented Oct 28, 2016

@runcom So we have this busted-ness and the containers/image busted-ness. This is being tested elsewhere. Do you know if there is any magic to get the tests to re-run other than adding a new commit?

@runcom
Copy link
Collaborator

runcom commented Oct 28, 2016

@jwhonce containers/image isn't busted anymore (your PR over there just requires a rebase now and it will be good to go).

As for the tests here, they are busted for various reason, such as, tests virtual machines running out of memory, cross compilation broken by some of our patches and whatnot. I would suggest having a Trello card to track a fix to our testing for projectatomic/docker and in the meantime just running tests locally as I do everytime.

@ncdc
Copy link

ncdc commented Oct 31, 2016

FYI I wasn't able to begin testing this until just now. I'll keep the infinite loop running as long as it will go and report if/when I see it break.

@nalind nalind force-pushed the docker-1.12.3-attach branch from 1b53c8b to 63187af Compare October 31, 2016 13:47
@ncdc
Copy link

ncdc commented Oct 31, 2016

So far so good. Still running. No flakes.

@ncdc
Copy link

ncdc commented Oct 31, 2016

(F24)

  • docker-latest-1.12.2-2.git8f1975c.fc24.0.bz1389474.1.x86_64
  • docker-common-1.12.2-5.git8f1975c.fc25.x86_64
  • container-selinux-1.12.2-5.git8f1975c.fc25.x86_64

@ncdc
Copy link

ncdc commented Oct 31, 2016

I'm not sure exactly how long it took, but my reproducer was able to reproduce the issue using the F24 RPMs listed above.

@imcleod
Copy link

imcleod commented Oct 31, 2016

Taking a closer look at the F24 candidate builds here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=16242921

I'm afraid it looks like they do not actually contain this fix. This commit is included in the SRPM as a patch but is not applied during the %prep process. I'm re-running a build with these patches included and applied on top of docker-latest-1.12.3 and will update with a link if it is successful.

@imcleod
Copy link

imcleod commented Oct 31, 2016

The x86_64 portion of the build has now finished successfully here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=16262294

You can confirm that the attach patch has applied in the build log:

https://kojipkgs.fedoraproject.org//work/tasks/2294/16262294/build.log

Apologies @ncdc but can you restart testing with this build?

@nalind
Copy link
Author

nalind commented Nov 1, 2016

My apologies. I mistakenly assumed that the Fedora .spec used %autosetup, like the EL .spec does, and didn't double-check it.

@imcleod
Copy link

imcleod commented Nov 1, 2016

@nalind - No worries

@imcleod
Copy link

imcleod commented Nov 1, 2016

@ncdc - Is your reproducer something sufficiently well documented and/or scripted that some others of us could run it locally?

@ncdc
Copy link

ncdc commented Nov 1, 2016

Yes. See the docker github issue I opened.

On Tuesday, November 1, 2016, Ian McLeod [email protected] wrote:

@ncdc https://github.com/ncdc - Is your reproducer something
sufficiently well documented and/or scripted that some others of us could
run it locally?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#212 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYu9cacC04AjzfONrL3gQQfONM2Bbks5q5zDLgaJpZM4Kj0c1
.

@imcleod
Copy link

imcleod commented Nov 1, 2016

Ahh yes. This:

moby#27289

@ncdc - Many thanks.

@runcom, @nalind, @jwhonce - See reproducer script in the issue linked above.

@ncdc
Copy link

ncdc commented Nov 1, 2016

I just kicked off a new test using the new F24 RPM.

@runcom
Copy link
Collaborator

runcom commented Nov 8, 2016

Any update?

@ncdc
Copy link

ncdc commented Nov 8, 2016

I wasn't able to get it to flake

On Tue, Nov 8, 2016 at 5:36 AM Antonio Murdaca [email protected]
wrote:

Any update?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#212 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYv8Oh5GVBYv40GZ34n4ko7K_oD1iks5q8FC0gaJpZM4Kj0c1
.

@runcom
Copy link
Collaborator

runcom commented Nov 8, 2016

Great!!! ping @rhatdan @imcleod @jwhonce

@rhatdan rhatdan merged commit cffb114 into projectatomic:docker-1.12.3 Nov 8, 2016
@ddarrah
Copy link

ddarrah commented Nov 15, 2016

it failed for me last night on rhel 7.3.0 and 1.12.3-4. Here is my setup and load:

running in container
running in container
running in container
running in container
running in container
rpc error: code = 2 desc = containerd: container not started
OMG
[root@rhel-730 ~]# 
[root@rhel-730 ~]# docker ps
CONTAINER ID        IMAGE                                     COMMAND                  CREATED             STATUS              PORTS                            NAMES
ea8413ec4d93        httpd                                     "httpd-foreground"       14 hours ago        Up 14 hours         80/tcp                           loving_agnesi
165d5a7aba11        nginx                                     "nginx -g 'daemon off"   16 hours ago        Up 16 hours         80/tcp, 443/tcp                  nginx
04f9249f23e9        cockpit/kubernetes:latest                 "/usr/libexec/cockpit"   16 hours ago        Up 16 hours         0.0.0.0:9090->9090/tcp           atomic-registry-console
5d3bd0447229        openshift/origin:latest                   "/usr/bin/openshift s"   16 hours ago        Up 16 hours         53/tcp, 0.0.0.0:8443->8443/tcp   atomic-registry-master
b1f6b13574e2        openshift/origin-docker-registry:latest   "/bin/sh -c 'DOCKER_R"   16 hours ago        Up 16 hours                                          atomic-registry
[root@rhel-730 ~]# docker version
Client:
 Version:         1.12.3
 API version:     1.24
 Package version: docker-common-1.12.3-4.el7.x86_64
 Go version:      go1.6.2
 Git commit:      f320458-redhat
 Built:           Mon Nov  7 10:15:24 2016
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.3
 API version:     1.24
 Package version: docker-common-1.12.3-4.el7.x86_64
 Go version:      go1.6.2
 Git commit:      f320458-redhat
 Built:           Mon Nov  7 10:15:24 2016
 OS/Arch:         linux/amd64
[root@rhel-730 ~]#

@ncdc
Copy link

ncdc commented Nov 15, 2016

I commented on the bz. This is a different failure condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants