-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the progresscontainer independent of each other #1306
Conversation
2ce518d
to
b064027
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took a first pass and i have a few questions.
-
since this is on a critical code path, how do we plan on ensuring it doesn't introduce any regressions? are we doing anything else outside scope of existing functional tests? i'm mainly wondering what kind of additional validation we should be doing for changes like this.
-
what happens if the agent restarts? we don't care about the state of the container's
ContainerTransitioningStatus
when we synchronize state?mtask
s will just continue to progress containers based on KnownStatus/DesiredStatus?
agent/api/containerstatus.go
Outdated
@@ -76,6 +106,14 @@ func (healthStatus ContainerHealthStatus) BackendStatus() string { | |||
} | |||
} | |||
|
|||
// Done returns whether the transitioing is done base don the container known status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling: "....based on the container..."
agent/api/container.go
Outdated
|
||
// Check if the container transition has already finished | ||
if containerTransitioningStatusMapping[c.transitioningStatus] <= knownStatus { | ||
c.transitioningStatus = ContainerNotTransition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a lock for this? also should we just be using the setter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should be called within SetKnownStatus, which has already held the lock. Setter is holding a lock where we don't need a lock here, that's why the name xxxUnsafe.
@adnxn For question 1: functional tests only test the happy path, I will need to test the failure case where a container failed at different stage:
Let me know if you can think of other test scenario that was missed here. For 2, we don't need to save the state, as agent has already has the logic to handle duplicate create/start, for pull/stop agent will have to pull/stop again on agent restart. |
b064027
to
5b3bf11
Compare
agent/api/container.go
Outdated
@@ -548,3 +553,38 @@ func (c *Container) GetHealthStatus() HealthStatus { | |||
|
|||
return copyHealth | |||
} | |||
|
|||
// UpdateTransitioningStatusUnsafe updates the container transitioning status | |||
func (c *Container) UpdateTransitioningStatusUnsafe(knownStatus ContainerStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be scoped to the package: updateTransitioningStatusUnsafe
agent/engine/task_manager.go
Outdated
@@ -619,7 +619,7 @@ func (mtask *managedTask) progressContainers() { | |||
// complete, but keep reading events as we do. in fact, we have to for | |||
// transitions to complete | |||
mtask.waitForContainerTransitions(transitions, transitionChange, transitionChangeContainer) | |||
seelog.Debugf("Managed task [%s]: done transitioning all containers", mtask.Arn) | |||
seelog.Debugf("Managed task [%s]: done transitioning container", mtask.Arn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also log the name of the container here?
agent/api/container.go
Outdated
@@ -166,6 +166,9 @@ type Container struct { | |||
// handled properly so that the state storage continues to work. | |||
SentStatusUnsafe ContainerStatus `json:"SentStatus"` | |||
|
|||
// transitioningStatus is the status of the container | |||
transitioningStatus ContainerTransitioningStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this different from AppliedStatus
? Also, can you add more details in the documentation about what this field is supposed to do? Why do we need? How will it be used etc?
Another question is do we need to persist it? If not, why not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I don't know there is already AppliedStatus
for the same purpose. After some investigation, I think we can just use the AppliedStatus
and the Transitioning status related should be removed. I will update the PR.
agent/api/container.go
Outdated
} | ||
|
||
// GetTransitioning returns the transitioning status of container | ||
func (c *Container) GetTransitioning() ContainerTransitioningStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please rename this to TransitionStatus()
?
agent/api/containerstatus.go
Outdated
@@ -49,9 +49,28 @@ const ( | |||
ContainerUnhealthy | |||
) | |||
|
|||
const ( | |||
// ContainerNotTransition means the container isn't in the middle of a transition | |||
ContainerNotTransition ContainerTransitioningStatus = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be ContainerTransitionNone
?
agent/api/containerstatus.go
Outdated
// ContainerNotTransition means the container isn't in the middle of a transition | ||
ContainerNotTransition ContainerTransitioningStatus = iota | ||
// ContainerPulling means the container is in the process of pulling | ||
ContainerPulling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure that ContainerTransition
is a prefix for all of these? Example: ContainerTransitionPulling
agent/api/containerstatus.go
Outdated
// ContainerStatus is an enumeration of valid states in the container lifecycle | ||
type ContainerStatus int32 | ||
|
||
// ContainerTransitioningStatus is an enumeration of valid container processing | ||
// status which indicates which status the container is being processed | ||
type ContainerTransitioningStatus int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move all of Transition statuses and methods to a separate file?
agent/api/container.go
Outdated
|
||
// SetTransitioningStatus sets the transitioning status of container and returns whether | ||
// the container is already in a transition | ||
func (c *Container) SetTransitioningStatus(status ContainerTransitioningStatus) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UpdateTransitioningTo
might be a better name here.
agent/engine/task_manager.go
Outdated
Status: transition.nextState, | ||
}, | ||
}) | ||
go func(cont *api.Container, status api.ContainerStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this logic changed? Can you please add documentation/comments for that as well?
agent/engine/task_manager.go
Outdated
break | ||
} | ||
} | ||
// Wait for one transition/acs/docker message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here. Why was this changed? What's the impact? Can we please add documentation explaining the change?
5b3bf11
to
9acbbbe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we manually test this when two or more containers are present in the task?
agent/api/container.go
Outdated
@@ -241,12 +243,14 @@ func (c *Container) GetKnownStatus() ContainerStatus { | |||
return c.KnownStatusUnsafe | |||
} | |||
|
|||
// SetKnownStatus sets the known status of the container | |||
// SetKnownStatus sets the known status of the container and update the container | |||
// transitioning status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: container applied status
} | ||
|
||
// Check if the container transition has already finished | ||
if c.AppliedStatus <= knownStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the condition c.AppliedStatus > knownStatus
ever be true here? We set the knownStatus
only when a transition is complete, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the agent received duplicate docker events for some reason, this could happen. As the event can be a past status of the container.
agent/engine/task_manager.go
Outdated
@@ -763,39 +776,23 @@ func (mtask *managedTask) onContainersUnableToTransitionState() { | |||
mtask.emitTaskEvent(mtask.Task, taskUnableToTransitionToStoppedReason) | |||
// TODO we should probably panic here | |||
} else { | |||
seelog.Criticalf("Managed task [%s]: voving task to stopped due to bad state", mtask.Arn) | |||
seelog.Criticalf("Managed task [%s]: moving task to stopped due to bad state", mtask.Arn) | |||
mtask.handleDesiredStatusChange(api.TaskStopped, 0) | |||
} | |||
} | |||
|
|||
func (mtask *managedTask) waitForContainerTransitions(transitions map[string]api.ContainerStatus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this method, since it does not wait for all the container transitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's still waiting for container transition. Let me know if you have a better suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh it just makes it look like it is waiting for ALL the containers' transitions, while it is waiting for just one to finish. may be just call it waitForContainerTransition
?
agent/engine/task_manager.go
Outdated
// a goroutine so that it won't block here before the waitForContainerTransitions | ||
// was called after this function. | ||
go func(cont *api.Container, status api.ContainerStatus) { | ||
mtask.dockerMessages <- dockerContainerChange{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding - how do we still handle the container status changes here that handleContainerChange
does when no action is required for the transition ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the events sent to mtask.dockerMessage
will be handled by handleContainerChange
which works the same as before except that we don't call handleContainerChange
directly now. Is that what you are asking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the events sent to
mtask.dockerMessage
will be handled byhandleContainerChange
which works the same as before except that we don't callhandleContainerChange
directly now
It'd be awesome if you added that as code comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardpen Got it, thanks
9acbbbe
to
10053da
Compare
This requires a CHANGELOG entry as well. Also, should this be tagged for 1.17.3 milestone? |
agent/engine/task_manager.go
Outdated
// a goroutine so that it won't block here before the waitForContainerTransitions | ||
// was called after this function. | ||
go func(cont *api.Container, status api.ContainerStatus) { | ||
mtask.dockerMessages <- dockerContainerChange{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the events sent to
mtask.dockerMessage
will be handled byhandleContainerChange
which works the same as before except that we don't callhandleContainerChange
directly now
It'd be awesome if you added that as code comment
agent/statemanager/state_manager.go
Outdated
@@ -60,7 +60,8 @@ const ( | |||
// 9) Add 'ipToTask' map to state file | |||
// 10) Add 'healthCheckType' field in 'api.Container' | |||
// 11) Add 'PrivateDNSName' field to 'api.ENI' | |||
ECSDataVersion = 11 | |||
// 12) Remove `AppliedStatus` field form 'api.Container' | |||
ECSDataVersion = 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should still be 11
. 10
is what's released out there.
10053da
to
a9fc9ad
Compare
agent/engine/task_manager.go
Outdated
// There could be multiple transitions, but we just need to wait for one of them | ||
// to ensure that there is at least one container can be processed in the next | ||
// progressContainers. This is done by waiting for one transition/acs/docker message. | ||
if mtask.waitEvent(transitionChange) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do not wait for a container transition here, next call of progressContainers
would happen, which may be a no-op. is that the reason why we are waiting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, otherwise it will exhaust the cpu resource with an empty for loop.
144c3c2
to
51bcc58
Compare
Previously the waitForContainerTransition will wait for all the transitions to be done before the container can move to next state. This PR changes the waitForContainerTransition to only wait for no more than one transition to be done so that the first container that done the transition doesn't need to wait for other containers.
51bcc58
to
2f169d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for answering my questions above. ive got one more minor comment. otherwise code lgtm.
also lol that we were writing AppliedStatus to statefile this whole time too. nice that its seeing some use now 😁
defer c.lock.Unlock() | ||
|
||
if c.AppliedStatus != ContainerStatusNone { | ||
// return false to indicate the set operation failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear to me why returning false here will indicate the set operation failed. do you mean the SetKnownStatus
and then updateAppliedStatusUnsafe
path has failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, failed means the container is already in a transition(where the appliedstatus isn't none). This ensures that agent won't call the same API(pull/create/start/stop) twice for the same container.
allWaitingOnPulled = false | ||
} | ||
} | ||
if allWaitingOnPulled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this condition handled in the new change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In previously the pull will block the whole transitions, that's why we want to break the wait if allWaitingOnPulled. Where the new change makes it non-blocked, which is handled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor comments. Also, can you make sure that we have a unit test that's testing the behavior where the order of events is being inspected? For example, a task with two containers, where one container (c1) takes a long time to pull or get created vs another one that doesn't (c2) and ensuring that c2 transitions to RUNNING before c1?
agent/engine/task_manager.go
Outdated
// to ensure that there is at least one container can be processed in the next | ||
// progressContainers. This is done by waiting for one transition/acs/docker message. | ||
if mtask.waitEvent(transitionChange) { | ||
changedContainer := <-transitionChangeContainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: transitionedContainer
is a better name for this
agent/engine/task_manager.go
Outdated
// There could be multiple transitions, but we just need to wait for one of them | ||
// to ensure that there is at least one container can be processed in the next | ||
// progressContainers. This is done by waiting for one transition/acs/docker message. | ||
if mtask.waitEvent(transitionChange) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: transition
is a better name for this
agent/engine/task_manager.go
Outdated
mtask.waitForContainerTransitions(transitions, transitionChange, transitionChangeContainer) | ||
seelog.Debugf("Managed task [%s]: done transitioning all containers", mtask.Arn) | ||
mtask.waitForContainerTransition(transitions, transitionChange, transitionChangeContainer) | ||
seelog.Debugf("Managed task [%s]: wait for container transition done", mtask.Arn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this line. You're logging in waitForContainerTransition()
anyway
agent/engine/task_manager.go
Outdated
// There could be multiple transitions, but we just need to wait for one of them | ||
// to ensure that there is at least one container can be processed in the next | ||
// progressContainers. This is done by waiting for one transition/acs/docker message. | ||
if mtask.waitEvent(transitionChange) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs an else
part, where we can log that we were interrupted or the transition did not finish or something like that.
Summary
Change the waitForContainerTransitions to only wait for one transition instead of all of the transitions so that each container can still move forward on its own without waiting for other transitions.
Implementation details
Remove the logic of waiting for all transitions in waitForContainerTransitions, and added a transition status to indicate the container is in the middle of a processing to avoid duplicate action on the same container.
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes:
Description for the changelog
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.