-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
engine,dockerclient: per container start/stop timeouts #1849
engine,dockerclient: per container start/stop timeouts #1849
Conversation
will squash to single commit before unWIP |
38b4496
to
057f252
Compare
@@ -550,8 +552,8 @@ func (dg *dockerGoClient) createContainer(ctx context.Context, | |||
return dg.containerMetadata(ctx, dockerContainer.ID) | |||
} | |||
|
|||
func (dg *dockerGoClient) StartContainer(ctx context.Context, id string, timeout time.Duration) DockerContainerMetadata { | |||
ctx, cancel := context.WithTimeout(ctx, timeout) | |||
func (dg *dockerGoClient) StartContainer(ctx context.Context, id string, ctxTimeout time.Duration) DockerContainerMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add similar comment for ctxTimeout here as well? is there valid range of values for this like it does for stop timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this is just what we chose for the start container timeout. we have some sane defaults here #1321, but since we're allowing it to be configurable from the task def we don't have a range check at this layer. the service is enforcing a lower bound of 2s and no upper bound.
the docker daemon's start container api doesnt enforce timeouts.
defer cancel() | ||
defer metrics.MetricsEngineGlobal.RecordDockerMetric("STOP_CONTAINER")() | ||
// Buffered channel so in the case of timeout it takes one write, never gets | ||
// read, and can still be GC'd | ||
response := make(chan DockerContainerMetadata, 1) | ||
go func() { response <- dg.stopContainer(ctx, dockerID) }() | ||
go func() { response <- dg.stopContainer(ctx, dockerID, timeout) }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be passing ctxTimeout? What is the difference between timeout and ctxTimeout?
For my understanding, why isn't startContainer handled the same with timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between timeout and ctxTimeout?
so i have some information in the comment above this code and also the PR description.
but also - ctxTimeout is for the timeout for the context the agent uses for the actual call to docker, where as timeout is a parameter for the api call to stop container call. the timeout parameter is how long the docker daemon will wait after a docker stop call to sigkill the container. we want our context timeout to be greater than the daemon's sigkill timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❗️ reminder: need to bump state file version for container fields
* start timeout is governed by the agent and is the context timeout for the StartContainer api call * stop timeout is the parameter passed to StopContainer and is time the docker daemon waits after a StopContainer call to issue a SIGKILL to the container
057f252
to
a7b34b2
Compare
cherry-picked changes on to |
@@ -236,6 +236,10 @@ type Container struct { | |||
|
|||
Secrets []*Secret `locationName:"secrets" type:"list"` | |||
|
|||
StartTimeout *int64 `locationName:"startTimeout" type:"integer"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure ECS service treats it as int64 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're treating this field as we do with cpu/mem
agent/statemanager/state_manager.go
Outdated
@@ -78,7 +78,10 @@ const ( | |||
// a) Add 'Associations' field to 'api.task.task' | |||
// b) Add 'GPUIDs' field to 'apicontainer.Container' | |||
// c) Add 'NvidiaRuntime' field to 'api.task.task' | |||
// 20) Add 'DependsOn' field to 'apicontainer.Container' | |||
// 20) | |||
// a) Add 'DependsOn' field to 'apicontainer.Container' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation
// 20) Add 'DependsOn' field to 'apicontainer.Container' | ||
// 20) | ||
// a) Add 'DependsOn' field to 'apicontainer.Container' | ||
// b) Add 'StartTime' field to 'api.container.Container' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add a test for the state file version bump in the state manager package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will add state manager unit tests for this.
@@ -1447,3 +1447,42 @@ func TestGPUAssociationTask(t *testing.T) { | |||
go taskEngine.AddTask(&taskUpdate) | |||
verifyTaskIsStopped(stateChangeEvents, testTask) | |||
} | |||
|
|||
func TestPerContainerStopTimeout(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason why this is under unix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding TODO for windows test.
timeout := engine.cfg.DockerStopTimeout + dockerclient.StopContainerTimeout | ||
return engine.client.StopContainer(engine.ctx, dockerContainer.DockerID, timeout) | ||
|
||
apiTimeoutStopContainer := container.GetStopTimeout() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about calling this stopContainerAPITimeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont have a strong feeling either way tbh
a7b34b2
to
3265bf9
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm.
just that ECS_CONTAINER_STOP_TIMEOUT
and stopTimeout
have different meanings and might confuse customers :(
since this is a communication problem, ill update the README entry for |
Summary
start timeout is governed by the agent and is the context timeout for
the StartContainer api call
stop timeout is the parameter passed to StopContainer and is time the
docker daemon waits after a StopContainer call to issue a SIGKILL to the
container
Implementation details
Testing
New tests cover the changes: yes
unit test:
TestPerContainerTimeouts
TestPostUnmarshalPerContainerTimeouts
integ test:
TestPerContainerStopTimeout
Description for the changelog
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.