Make the progresscontainer independent of each other #1306

richardpen · 2018-03-19T23:37:23Z

Summary

Change the waitForContainerTransitions to only wait for one transition instead of all of the transitions so that each container can still move forward on its own without waiting for other transitions.

Implementation details

Remove the logic of waiting for all transitions in waitForContainerTransitions, and added a transition status to indicate the container is in the middle of a processing to avoid duplicate action on the same container.

Testing

Builds on Linux (make release)
Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
Unit tests on Linux (make test) pass
Unit tests on Windows (go test -timeout=25s ./agent/...) pass
Integration tests on Linux (make run-integ-tests) pass
Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
Functional tests on Linux (make run-functional-tests) pass
Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass
Manual test with task that has multiple containers, containers are progressed independently.
- Task with multiple containers
- Task that uses volume (including empty volume and non-empty volume)
- Task that uses volumesFrom
- Task that uses links
- Set a cluster to periodically run a bunch of tasks

New tests cover the changes:

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

adnxn

took a first pass and i have a few questions.

since this is on a critical code path, how do we plan on ensuring it doesn't introduce any regressions? are we doing anything else outside scope of existing functional tests? i'm mainly wondering what kind of additional validation we should be doing for changes like this.
what happens if the agent restarts? we don't care about the state of the container's ContainerTransitioningStatus when we synchronize state? mtasks will just continue to progress containers based on KnownStatus/DesiredStatus?

adnxn · 2018-03-20T16:57:45Z

agent/api/containerstatus.go

@@ -76,6 +106,14 @@ func (healthStatus ContainerHealthStatus) BackendStatus() string {
 	}
 }

+// Done returns whether the transitioing is done base don the container known status


spelling: "....based on the container..."

adnxn · 2018-03-20T17:10:42Z

agent/api/container.go

+
+	// Check if the container transition has already finished
+	if containerTransitioningStatusMapping[c.transitioningStatus] <= knownStatus {
+		c.transitioningStatus = ContainerNotTransition


do we need a lock for this? also should we just be using the setter here?

This method should be called within SetKnownStatus, which has already held the lock. Setter is holding a lock where we don't need a lock here, that's why the name xxxUnsafe.

richardpen · 2018-03-20T17:47:03Z

@adnxn For question 1: functional tests only test the happy path, I will need to test the failure case where a container failed at different stage:

Task with multiple containers
Task that uses volume (including empty volume and non-empty volume)
Task that uses volumesFrom
Task that uses links
Set a cluster to periodically run a bunch of tasks

Let me know if you can think of other test scenario that was missed here.

For 2, we don't need to save the state, as agent has already has the logic to handle duplicate create/start, for pull/stop agent will have to pull/stop again on agent restart.

aaithal · 2018-03-21T18:20:51Z

agent/api/container.go

@@ -548,3 +553,38 @@ func (c *Container) GetHealthStatus() HealthStatus {

 	return copyHealth
 }
+
+// UpdateTransitioningStatusUnsafe updates the container transitioning status
+func (c *Container) UpdateTransitioningStatusUnsafe(knownStatus ContainerStatus) {


This seems like it should be scoped to the package: updateTransitioningStatusUnsafe

aaithal · 2018-03-21T18:22:09Z

agent/engine/task_manager.go

@@ -619,7 +619,7 @@ func (mtask *managedTask) progressContainers() {
 	// complete, but keep reading events as we do. in fact, we have to for
 	// transitions to complete
 	mtask.waitForContainerTransitions(transitions, transitionChange, transitionChangeContainer)
-	seelog.Debugf("Managed task [%s]: done transitioning all containers", mtask.Arn)
+	seelog.Debugf("Managed task [%s]: done transitioning container", mtask.Arn)


Can you also log the name of the container here?

aaithal · 2018-03-21T18:26:30Z

agent/api/container.go

@@ -166,6 +166,9 @@ type Container struct {
 	// handled properly so that the state storage continues to work.
 	SentStatusUnsafe ContainerStatus `json:"SentStatus"`

+	// transitioningStatus is the status of the container
+	transitioningStatus ContainerTransitioningStatus


how is this different from AppliedStatus ? Also, can you add more details in the documentation about what this field is supposed to do? Why do we need? How will it be used etc?

Another question is do we need to persist it? If not, why not.

Thanks for pointing this out, I don't know there is already AppliedStatus for the same purpose. After some investigation, I think we can just use the AppliedStatus and the Transitioning status related should be removed. I will update the PR.

aaithal · 2018-03-21T19:26:13Z

agent/api/container.go

+}
+
+// GetTransitioning returns the transitioning status of container
+func (c *Container) GetTransitioning() ContainerTransitioningStatus {


can you please rename this to TransitionStatus()?

aaithal · 2018-03-21T19:27:24Z

agent/api/containerstatus.go

@@ -49,9 +49,28 @@ const (
 	ContainerUnhealthy
 )

+const (
+	// ContainerNotTransition means the container isn't in the middle of a transition
+	ContainerNotTransition ContainerTransitioningStatus = iota


Can this be ContainerTransitionNone?

aaithal · 2018-03-21T19:28:04Z

agent/api/containerstatus.go

+	// ContainerNotTransition means the container isn't in the middle of a transition
+	ContainerNotTransition ContainerTransitioningStatus = iota
+	// ContainerPulling means the container is in the process of pulling
+	ContainerPulling


Can you make sure that ContainerTransition is a prefix for all of these? Example: ContainerTransitionPulling

aaithal · 2018-03-21T19:29:45Z

agent/api/containerstatus.go

 // ContainerStatus is an enumeration of valid states in the container lifecycle
 type ContainerStatus int32

+// ContainerTransitioningStatus is an enumeration of valid container processing
+// status which indicates which status the container is being processed
+type ContainerTransitioningStatus int32


Can you move all of Transition statuses and methods to a separate file?

aaithal · 2018-03-21T19:32:04Z

agent/api/container.go

+
+// SetTransitioningStatus sets the transitioning status of container and returns whether
+// the container is already in a transition
+func (c *Container) SetTransitioningStatus(status ContainerTransitioningStatus) bool {


UpdateTransitioningTo might be a better name here.

aaithal · 2018-03-21T19:34:24Z

agent/engine/task_manager.go

-					Status: transition.nextState,
-				},
-			})
+			go func(cont *api.Container, status api.ContainerStatus) {


why was this logic changed? Can you please add documentation/comments for that as well?

aaithal · 2018-03-21T19:35:27Z

agent/engine/task_manager.go

-				break
-			}
-		}
+	// Wait for one transition/acs/docker message


Same comment here. Why was this changed? What's the impact? Can we please add documentation explaining the change?

sharanyad

Did we manually test this when two or more containers are present in the task?

sharanyad · 2018-03-21T20:35:18Z

agent/api/container.go

@@ -241,12 +243,14 @@ func (c *Container) GetKnownStatus() ContainerStatus {
 	return c.KnownStatusUnsafe
 }

-// SetKnownStatus sets the known status of the container
+// SetKnownStatus sets the known status of the container and update the container
+// transitioning status


nit: container applied status

sharanyad · 2018-03-21T20:39:10Z

agent/api/container.go

+	}
+
+	// Check if the container transition has already finished
+	if c.AppliedStatus <= knownStatus {


Would the condition c.AppliedStatus > knownStatus ever be true here? We set the knownStatus only when a transition is complete, right?

If the agent received duplicate docker events for some reason, this could happen. As the event can be a past status of the container.

sharanyad · 2018-03-21T20:45:11Z

agent/engine/task_manager.go

@@ -763,39 +776,23 @@ func (mtask *managedTask) onContainersUnableToTransitionState() {
 		mtask.emitTaskEvent(mtask.Task, taskUnableToTransitionToStoppedReason)
 		// TODO we should probably panic here
 	} else {
-		seelog.Criticalf("Managed task [%s]: voving task to stopped due to bad state", mtask.Arn)
+		seelog.Criticalf("Managed task [%s]: moving task to stopped due to bad state", mtask.Arn)
 		mtask.handleDesiredStatusChange(api.TaskStopped, 0)
 	}
 }

 func (mtask *managedTask) waitForContainerTransitions(transitions map[string]api.ContainerStatus,


Can we rename this method, since it does not wait for all the container transitions?

But it's still waiting for container transition. Let me know if you have a better suggestion.

oh it just makes it look like it is waiting for ALL the containers' transitions, while it is waiting for just one to finish. may be just call it waitForContainerTransition?

sharanyad · 2018-03-21T21:15:47Z

agent/engine/task_manager.go

+			// a goroutine so that it won't block here before the waitForContainerTransitions
+			// was called after this function.
+			go func(cont *api.Container, status api.ContainerStatus) {
+				mtask.dockerMessages <- dockerContainerChange{


For my understanding - how do we still handle the container status changes here that handleContainerChange does when no action is required for the transition ?

All the events sent to mtask.dockerMessage will be handled by handleContainerChange which works the same as before except that we don't call handleContainerChange directly now. Is that what you are asking?

All the events sent to mtask.dockerMessage will be handled by handleContainerChange which works the same as before except that we don't call handleContainerChange directly now

It'd be awesome if you added that as code comment

@richardpen Got it, thanks

aaithal · 2018-03-21T23:06:51Z

This requires a CHANGELOG entry as well. Also, should this be tagged for 1.17.3 milestone?

aaithal · 2018-03-21T23:14:17Z

agent/engine/task_manager.go

+			// a goroutine so that it won't block here before the waitForContainerTransitions
+			// was called after this function.
+			go func(cont *api.Container, status api.ContainerStatus) {
+				mtask.dockerMessages <- dockerContainerChange{


All the events sent to mtask.dockerMessage will be handled by handleContainerChange which works the same as before except that we don't call handleContainerChange directly now

It'd be awesome if you added that as code comment

aaithal · 2018-03-21T23:29:18Z

agent/statemanager/state_manager.go

@@ -60,7 +60,8 @@ const (
 	// 9) Add 'ipToTask' map to state file
 	// 10) Add 'healthCheckType' field in 'api.Container'
 	// 11) Add 'PrivateDNSName' field to 'api.ENI'
-	ECSDataVersion = 11
+	// 12) Remove `AppliedStatus` field form 'api.Container'
+	ECSDataVersion = 12


This should still be 11. 10 is what's released out there.

sharanyad · 2018-03-22T01:20:12Z

agent/engine/task_manager.go

+	// There could be multiple transitions, but we just need to wait for one of them
+	// to ensure that there is at least one container can be processed in the next
+	// progressContainers. This is done by waiting for one transition/acs/docker message.
+	if mtask.waitEvent(transitionChange) {


If we do not wait for a container transition here, next call of progressContainers would happen, which may be a no-op. is that the reason why we are waiting here?

yes, otherwise it will exhaust the cpu resource with an empty for loop.

Previously the waitForContainerTransition will wait for all the transitions to be done before the container can move to next state. This PR changes the waitForContainerTransition to only wait for no more than one transition to be done so that the first container that done the transition doesn't need to wait for other containers.

adnxn

thanks for answering my questions above. ive got one more minor comment. otherwise code lgtm.

also lol that we were writing AppliedStatus to statefile this whole time too. nice that its seeing some use now 😁

adnxn · 2018-03-23T16:48:58Z

agent/api/container.go

+	defer c.lock.Unlock()
+
+	if c.AppliedStatus != ContainerStatusNone {
+		// return false to indicate the set operation failed


it's not clear to me why returning false here will indicate the set operation failed. do you mean the SetKnownStatus and then updateAppliedStatusUnsafe path has failed?

No, failed means the container is already in a transition(where the appliedstatus isn't none). This ensures that agent won't call the same API(pull/create/start/stop) twice for the same container.

sharanyad · 2018-03-23T18:21:59Z

agent/engine/task_manager.go

-					allWaitingOnPulled = false
-				}
-			}
-			if allWaitingOnPulled {


How is this condition handled in the new change?

In previously the pull will block the whole transitions, that's why we want to break the wait if allWaitingOnPulled. Where the new change makes it non-blocked, which is handled by default.

aaithal

I have some minor comments. Also, can you make sure that we have a unit test that's testing the behavior where the order of events is being inspected? For example, a task with two containers, where one container (c1) takes a long time to pull or get created vs another one that doesn't (c2) and ensuring that c2 transitions to RUNNING before c1?

aaithal · 2018-03-26T16:33:15Z

agent/engine/task_manager.go

+	// to ensure that there is at least one container can be processed in the next
+	// progressContainers. This is done by waiting for one transition/acs/docker message.
+	if mtask.waitEvent(transitionChange) {
+		changedContainer := <-transitionChangeContainer


minor: transitionedContainer is a better name for this

aaithal · 2018-03-26T16:33:32Z

agent/engine/task_manager.go

+	// There could be multiple transitions, but we just need to wait for one of them
+	// to ensure that there is at least one container can be processed in the next
+	// progressContainers. This is done by waiting for one transition/acs/docker message.
+	if mtask.waitEvent(transitionChange) {


minor: transition is a better name for this

aaithal · 2018-03-26T16:34:48Z

agent/engine/task_manager.go

-	mtask.waitForContainerTransitions(transitions, transitionChange, transitionChangeContainer)
-	seelog.Debugf("Managed task [%s]: done transitioning all containers", mtask.Arn)
+	mtask.waitForContainerTransition(transitions, transitionChange, transitionChangeContainer)
+	seelog.Debugf("Managed task [%s]: wait for container transition done", mtask.Arn)


I don't think we need this line. You're logging in waitForContainerTransition() anyway

aaithal · 2018-03-26T16:39:08Z

agent/engine/task_manager.go

+	// There could be multiple transitions, but we just need to wait for one of them
+	// to ensure that there is at least one container can be processed in the next
+	// progressContainers. This is done by waiting for one transition/acs/docker message.
+	if mtask.waitEvent(transitionChange) {


This also needs an else part, where we can log that we were interrupted or the transition did not finish or something like that.

richardpen requested review from adnxn, aaithal, petderek and sharanyad March 19, 2018 23:37

richardpen force-pushed the parallel_container_process branch from 2ce518d to b064027 Compare March 20, 2018 17:14

adnxn reviewed Mar 20, 2018

View reviewed changes

richardpen force-pushed the parallel_container_process branch from b064027 to 5b3bf11 Compare March 20, 2018 21:20

richardpen added this to the 1.18.0 milestone Mar 21, 2018

aaithal reviewed Mar 21, 2018

View reviewed changes

richardpen force-pushed the parallel_container_process branch from 5b3bf11 to 9acbbbe Compare March 21, 2018 20:11

sharanyad reviewed Mar 21, 2018

View reviewed changes

richardpen force-pushed the parallel_container_process branch from 9acbbbe to 10053da Compare March 21, 2018 22:48

aaithal closed this Mar 21, 2018

aaithal reopened this Mar 21, 2018

aaithal reviewed Mar 21, 2018

View reviewed changes

test: fix flaky test TestRemoveEvents

a61fee7

richardpen force-pushed the parallel_container_process branch from 10053da to a9fc9ad Compare March 21, 2018 23:35

richardpen modified the milestones: 1.18.0, 1.17.3 Mar 21, 2018

sharanyad reviewed Mar 22, 2018

View reviewed changes

richardpen force-pushed the parallel_container_process branch 3 times, most recently from 144c3c2 to 51bcc58 Compare March 22, 2018 18:50

richardpen force-pushed the parallel_container_process branch from 51bcc58 to 2f169d1 Compare March 22, 2018 23:41

adnxn approved these changes Mar 23, 2018

View reviewed changes

sharanyad reviewed Mar 23, 2018

View reviewed changes

aaithal reviewed Mar 26, 2018

View reviewed changes

test: add test to for parallizing container progress

1ceeadd

aaithal approved these changes Mar 27, 2018

View reviewed changes

richardpen merged commit 2ee395f into aws:dev Mar 27, 2018

Make the progresscontainer independent of each other #1306

Make the progresscontainer independent of each other #1306

Conversation

richardpen commented Mar 19, 2018 • edited Loading

Summary

Implementation details

Testing

Description for the changelog

Licensing

adnxn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardpen commented Mar 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sharanyad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaithal commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sharanyad Mar 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adnxn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sharanyad Mar 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaithal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardpen commented Mar 19, 2018 •

edited

Loading

sharanyad Mar 22, 2018 •

edited

Loading

sharanyad Mar 23, 2018 •

edited

Loading