Skip to content
This repository was archived by the owner on Feb 8, 2021. It is now read-only.

notify event when start container failed #370

Closed
wants to merge 1 commit into from

Conversation

Crazykev
Copy link
Contributor

add a event notify event when start container failed, fix #366

Signed-off-by: Crazykev [email protected]

@WeiZhang555
Copy link
Contributor

This can be one solution, but not enough. here:
https://github.com/hyperhq/runv/blob/master/supervisor/hyperpod.go#L415

func (hp *HyperPod) createContainer(container, bundlePath, stdin, stdout, stderr string, spec *specs.Spec) (*Container, error) {
...
    c.run(p)
    return c, nil
}

Indeed c.run(p) failed but createContainer didn't check the error, it return successfully and docker will regard the container as running successfully, but it's not true.

Your patch will let container be running then exited but it's not good enough.

What's worse, the c.start(p) can sometimes hang, and the codes you added won't be reached. To fully solve this problem:

  1. we have to find why the start() function hangs, or the bug will always be there.
  2. we should let c.run(p) return error/nil when container start() failed/succeed, we can't simply ignore the error from run(p), that makes no sense.

@Crazykev
Copy link
Contributor Author

Crazykev commented Oct 21, 2016

@WeiZhang555
Actuall, c.run(p) will(and should) feedback the result through Event mechanism, client will get feedback from here:
https://github.com/hyperhq/runv/blob/master/start.go#L266 , After c.CreateContainer at L252.
The reason why it hangs(in my test) is that if c.start(p) failed with error from here:
https://github.com/hyperhq/runv/blob/master/supervisor/container.go#L35
will reap and return, without feedback event.
So event loop from client(start.go#L266) will never get any event, that is hang forever.

The patch here just solve this problem, if start container failed in c.start(p), make sure client get notified. But there is still something bad, like we can't get the real reason why start failed.

What's worse, the c.start(p) can sometimes hang,

Can you provide more information about this, log or something ?

@WeiZhang555
Copy link
Contributor

WeiZhang555 commented Oct 21, 2016

Can you provide more information about this, log or something ?

This happens during my pressure test, see #366. Debug log is too long, and I can't find useful error message from the log.

I can see you are running your test with runv native command (e.g. runv start), what I was talking about is calling runv from docker daemon, which means docker daemon ==> runv-containerd. You added an exit event, which will notify docker daemon to mark the container as exited, it's good and improved the leaking container problem a little bit. But INDEED the container is not exited, it was never started successfully, so this mechanism is not quite right.

Besides that, c.start() hangs will make this situation worse, we need to find why it hangs.

Sorry again that I can't provide the log, because it's truly toooooo long and I can't find useful info for this bug.

@wrouesnel
Copy link
Contributor

I don't think sending exit before start-container makes sense. From a libcontainerd API consumer perspective, nothing happened because a start failure implies the VM itself failed to bring the container up - it never actually entered it's normal runtime process.

@Crazykev
Copy link
Contributor Author

@wrouesnel Yes, I totally agree with you that we should not

sending exit before start-container

This should be a new state/singal, ie internal failure, to be sent to client to explain why we start container failed instead of hang forever for now.

@WeiZhang555 And also, There are more things to change in runv, so maybe close this or consider again whether this issue still exist in "new" runv. Sorry for the delay.

@laijs
Copy link
Contributor

laijs commented Aug 3, 2017

@Crazykev @wrouesnel @WeiZhang555 runv cli is being improved continuously and had just been refactored at #537 . As a result, it seams that the changes in this pr is outdated.
Thank you for your contribution.

@laijs laijs closed this Aug 3, 2017
@Crazykev Crazykev deleted the fix-hang branch September 26, 2017 04:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug report: leaking container while doing pressure test
4 participants