-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure task status is reported before cleanup #705
Conversation
@@ -28,10 +28,10 @@ import ( | |||
) | |||
|
|||
func contEvent(arn string) api.ContainerStateChange { | |||
return api.ContainerStateChange{TaskArn: arn, ContainerName: "containerName", Status: api.ContainerRunning} | |||
return api.ContainerStateChange{TaskArn: arn, ContainerName: "containerName", Status: api.ContainerRunning, Container: &api.Container{}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an empty struct and not nil
?
Also, a nit: Breaking this across multiple lines would be even nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Container.GetSentStatus()
and Container.SetSentStatus()
are called without nil
-checks.
agent/engine/task_manager.go
Outdated
for !mtask.waitEvent(cleanupTimeBool) { | ||
} | ||
stoppedSentBool := make(chan bool) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be broken out into a named method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Looks super neat overall. Words can't express my joy about the TaskEngineState interface and locks around |
db4f950
to
1db71f5
Compare
@aaithal I've updated the copyright on all the files I touched. |
@samuelkarp the new commit lgtm. |
Prior to this change, a race condition exists between reporting task status and task cleanup in the agent. If reporting task status took an excessively long time and ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION is set very short, the containers and associated task metadata could be removed before the ECS backend is informed that the task has stopped. When the task is especially short-lived, it's also possible that the cleanup could occur before the ECS backend is even informed that the task is running. In this situation, it's possible for the ECS backend to re-transmit the task to the agent (assuming that it hadn't started yet) and the de-duplication logic in the agent can break. With this change, we ensure that the task status is reported prior to cleanup actually starting.
1fdca9a
to
61371b4
Compare
agent/engine/task_manager.go
Outdated
for !mtask.waitEvent(cleanupTimeBool) { | ||
} | ||
stoppedSentBool := make(chan bool) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the new written go routine, shall we start enforcing the rule of 'always pass "Context" to it', so that it can provide simple "cancelation"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no context
wired in to most of our codebase as most of it was written before context
existed. We could add that to the managedTask
but that's outside the scope of my change here.
I don't think that always passing context
is a hard rule that we should enforce. I think that whether or not we should use context
depends on what the code is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without context, there is no way to stop this potential "long run" NEW go routine (which could run for at worst 72 hours).
In case if there is another kind of state mismatch between agent and backend, backend thinks this instance is able to launch new task but agent is holding those "long run" cleanup GO routines. Is it possible, that agent could run out of memory ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant "backend service" keep on starting a new task, and these task get stuck in "cleanup" state for 72 hours..., eventually will agent run out of memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the desired behavior. A successful submission of task state will result in this goroutine exiting. Unsuccessful submissions will delay until success or the timeout or 72 hours, whichever is sooner. There is no use-case to stop the goroutine other than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat. I had a few mostly curious questions — I don't think anything needs to change now.
// express or implied. See the License for the specific language governing | ||
// permissions and limitations under the License. | ||
|
||
package dockerstate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming? It may just be me, but when I saw this referenced in other bits of code it I assumed its part of docker's library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that this name is bad. I didn't want to change it as part of this PR though.
|
||
// Allow Task cleanup to occur | ||
time.Sleep(2 * time.Second) | ||
time.Sleep(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why time.Sleep
? Is there an event we can listen for instead of setting an arbitrary time?
Is there a best practice around testing these sorts of interactions in Go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Event-based would be better, but we'd actually still want a timeout on the event. Part of the tests here are ensuring that cleanup happens within the time that is expected.
break | ||
} | ||
seelog.Warnf("Blocking cleanup for task %v until the task has been reported stopped. SentStatus: %v (%d/%d)", mtask, sentStatus, i+1, _maxStoppedWaitTimes) | ||
mtask._time.Sleep(_stoppedSentWaitInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why are you using
_time
as a member of mtask instead of just usingtime.Sleep
directly? - Why not use exponential backoff with the same 72 hour cap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_time
is attime.Time
implementation that can be swapped out for tests. In unit tests, we inject a mock here that can let us verify and control the interactions that the code is doing with respect to time.- I didn't feel like exponential backoff was necessary or particularly desirable here; the only thing that this does is check the value of a variable. Constant delay is also a bit easier to read.
Summary
There's a rare bug that can occur when the following situation happens:
SubmitTaskStateChange
andSubmitContainerStateChange
This can lead to the agent becoming extremely confused and internal state getting corrupted, possibly leading to even more calls to
Submit*StateChange
and attempts to pull whatever images are referenced in the corrupted task.This change attempts to force the agent to wait until the status has been properly submitted before starting cleanup, and aborting cleanup if the task never gets submitted properly.
Implementation details
api.TaskStateChange
andapi.ContainerStateChange
structs real pointers to theapi.Task
andapi.Container
rather than pointers to fields of those respective structsmanagedTask.cleanupTask
to check theSentStatus
field and wait if it's notapi.TaskStopped
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes: yes
Description for the changelog
Bug - Fixed a bug where throttles on state change reporting could lead to corrupted state
Licensing
This contribution is under the terms of the Apache 2.0 License: yes (Amazon employee)