-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout for setting up pause container #1358
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did we come up with a timeout value of of 1 minute?
agent/ecscni/plugin.go
Outdated
|
||
select { | ||
case <-ctx.Done(): | ||
return nil, errors.Errorf("cni setup: setup the container namespace timedout") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can just be errors.New
?
also, message can be "cni: container namespace setup timed out"
agent/engine/docker_task_engine.go
Outdated
@@ -61,6 +61,7 @@ const ( | |||
labelTaskDefinitionFamily = labelPrefix + "task-definition-family" | |||
labelTaskDefinitionVersion = labelPrefix + "task-definition-version" | |||
labelCluster = labelPrefix + "cluster" | |||
cniSetupTimeoutDuration = 1 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can just be cniSetupTimeout
Based on the test, it seems this process is really fast on different instance <5s. But in the cni plugin itself, there are some retry and timeout(maximum is 30s) for retrieving instance metadata and ipv6 gateway, so I doubled it to 1mins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes lgtm, just minor edit to error msg wording as @sharanyad called out.
agent/ecscni/plugin.go
Outdated
@@ -69,7 +70,24 @@ func NewClient(cfg *Config) CNIClient { | |||
|
|||
// SetupNS will set up the namespace of container, including create the bridge | |||
// and the veth pair, move the eni to container namespace, setup the routes | |||
func (client *cniClient) SetupNS(cfg *Config) (*current.Result, error) { | |||
func (client *cniClient) SetupNS(cfg *Config, ctx context.Context) (*current.Result, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure that ctx
is the first argument here.
agent/ecscni/plugin.go
Outdated
@@ -38,7 +39,7 @@ type CNIClient interface { | |||
// Capabilities returns the capabilities supported by a plugin | |||
Capabilities(string) ([]string, error) | |||
// SetupNS sets up the namespace of container | |||
SetupNS(*Config) (*current.Result, error) | |||
SetupNS(*Config, context.Context) (*current.Result, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure that CleanupNS()
also has a context wired?
agent/ecscni/plugin.go
Outdated
var err error | ||
setupDone := make(chan struct{}) | ||
go func() { | ||
result, err = client.setupNS(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this actually times out to the plugin execution path? Will they still be invoked in an orphan process? Can you please add details about how you tested this? What the behavior is? Essentially, how safe is it to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the timeout happens, agent will move the task to stopped but the plugin was still running until it wasn't stuck. Ideally we should stop the process, but we don't know the plugin process id, I have opened the issue in libcni for adding the functionality to stop the plugin process after a timeout. The test I did is having two awsvpc task, the first was timedout during cni setup, and the second task can still progress successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we try to invoke CleanupNS()
in that scenario then (if SetupNS()
times out invoke CleanupNS()
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CleanupNS
is already called after the setupNS
timeout: when setupNS timeout, the task desired status was moved to stop, and agent will try to stop pause container at which time the CleanupNS
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please create an issue in the agent repo to track this as well (and link the cni issue to that)? It doesn't seem super optimal to just leave a process hanging if we decide to give up early on it.
agent/ecscni/plugin.go
Outdated
var err error | ||
setupDone := make(chan struct{}) | ||
go func() { | ||
result, err = client.setupNS(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though it's not today, it seems like this could get racy
in the future. Instead of this semi-global var pattern, can you instead pass an intermediate struct back in setupDone
channel?
agent/ecscni/plugin.go
Outdated
|
||
select { | ||
case <-ctx.Done(): | ||
return nil, errors.New("cni setup: container namespace setup timed out") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should also check for ctx.Err() here and wrap that if non nil instead of always assuming that context timed out.
agent/ecscni/plugin.go
Outdated
|
||
select { | ||
case <-ctx.Done(): | ||
return errors.New("cni cleanup: container namespace cleanup timed out") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above regarding ctx.Err()
CHANGELOG.md
Outdated
@@ -1,5 +1,8 @@ | |||
# Changelog | |||
|
|||
## 1.17.4-dev | |||
* Bug - Fixed a bug where task stuck in pending due to the cni setup process hangs [#1358](https://github.com/aws/amazon-ecs-agent/pull/1358) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about Fixed a bug where tasks could get stuck waiting for execution of CNI scripts
?
agent/ecscni/plugin.go
Outdated
|
||
select { | ||
case <-ctx.Done(): | ||
return nil, errors.Wrap(ctx.Err(), "cni setup: container namespace setup failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to check if ctx.Err()
is non nil. A much better model would be to create the derived context within this method and handle ctx.Err() == nil
case as well.
You essentially do
derivedCtx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
go func(...) {
}
select {
case <-ctx.Done():
if ctc.Err() != nil {
return nil, errors.Wrap()
} else {
return nil, errors.New()
}
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the document here, I don't think check ctx.Err()!=nil
is required.
// If Done is not yet closed, Err returns nil.
// If Done is closed, Err returns a non-nil error explaining why:
// Canceled if the context was canceled
// or DeadlineExceeded if the context's deadline passed.
// After Err returns a non-nil error, successive calls to Err return the same error.
A much better model would be to create the derived context within this method
Yes, a timeout instead should be passed in instead of passing the context, will modify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the document here, I don't think check
ctx.Err()!=nil
is required.
Thanks! make sense.
agent/ecscni/plugin.go
Outdated
@@ -110,7 +133,21 @@ func (client *cniClient) SetupNS(cfg *Config) (*current.Result, error) { | |||
|
|||
// CleanupNS will clean up the container namespace, including remove the veth | |||
// pair and stop the dhclient | |||
func (client *cniClient) CleanupNS(cfg *Config) error { | |||
func (client *cniClient) CleanupNS(ctx context.Context, cfg *Config) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about accepting timeout as a param and dealing with ctx.Err()
as above.
Summary
Added timeout for SetupNS.
Implementation details
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes:
Description for the changelog
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.