Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure task status is reported before cleanup #705

Merged
merged 9 commits into from
Feb 14, 2017

Conversation

samuelkarp
Copy link
Contributor

@samuelkarp samuelkarp commented Feb 10, 2017

Summary

There's a rare bug that can occur when the following situation happens:

  • Extremely high task launch/task stop rate in the cluster leading to throttles on SubmitTaskStateChange and SubmitContainerStateChange
  • Very low ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION`
  • The agent reconnects to ECS after the task has been cleaned up locally but before the ECS API was successfully notified (due to throttles).

This can lead to the agent becoming extremely confused and internal state getting corrupted, possibly leading to even more calls to Submit*StateChange and attempts to pull whatever images are referenced in the corrupted task.

This change attempts to force the agent to wait until the status has been properly submitted before starting cleanup, and aborting cleanup if the task never gets submitted properly.

Implementation details

  • Gave the api.TaskStateChange and api.ContainerStateChange structs real pointers to the api.Task and api.Container rather than pointers to fields of those respective structs
  • Modified managedTask.cleanupTask to check the SentStatus field and wait if it's not api.TaskStopped
  • Added some unit tests
  • Fixed a bunch of warnings from gometalinter

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass
  • Created high load on an instance in a cluster with artificially low throttles

New tests cover the changes: yes

Description for the changelog

Bug - Fixed a bug where throttles on state change reporting could lead to corrupted state

Licensing

This contribution is under the terms of the Apache 2.0 License: yes (Amazon employee)

@samuelkarp samuelkarp added this to the 1.14.1 milestone Feb 10, 2017
@@ -28,10 +28,10 @@ import (
)

func contEvent(arn string) api.ContainerStateChange {
return api.ContainerStateChange{TaskArn: arn, ContainerName: "containerName", Status: api.ContainerRunning}
return api.ContainerStateChange{TaskArn: arn, ContainerName: "containerName", Status: api.ContainerRunning, Container: &api.Container{}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an empty struct and not nil?

Also, a nit: Breaking this across multiple lines would be even nicer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container.GetSentStatus() and Container.SetSentStatus() are called without nil-checks.

for !mtask.waitEvent(cleanupTimeBool) {
}
stoppedSentBool := make(chan bool)
go func() {
Copy link
Contributor

@aaithal aaithal Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be broken out into a named method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@aaithal
Copy link
Contributor

aaithal commented Feb 10, 2017

Looks super neat overall. Words can't express my joy about the TaskEngineState interface and locks around SentStatus. I have some minor comments/questions. Also, can you ensure that all the edited files have the copyright year 2017 in them? I think you might have missed some.

@samuelkarp samuelkarp force-pushed the consistent-state-cleanup branch from db4f950 to 1db71f5 Compare February 10, 2017 20:29
@samuelkarp
Copy link
Contributor Author

@aaithal I've updated the copyright on all the files I touched.

@aaithal
Copy link
Contributor

aaithal commented Feb 10, 2017

@samuelkarp the new commit lgtm.

Prior to this change, a race condition exists between reporting task
status and task cleanup in the agent.  If reporting task status took an
excessively long time and ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION is set
very short, the containers and associated task metadata could be removed
before the ECS backend is informed that the task has stopped.  When the
task is especially short-lived, it's also possible that the cleanup could
occur before the ECS backend is even informed that the task is running.
In this situation, it's possible for the ECS backend to re-transmit the
task to the agent (assuming that it hadn't started yet) and the
de-duplication logic in the agent can break.

With this change, we ensure that the task status is reported prior to
cleanup actually starting.
@samuelkarp samuelkarp force-pushed the consistent-state-cleanup branch from 1fdca9a to 61371b4 Compare February 14, 2017 01:02
@samuelkarp samuelkarp merged commit 61371b4 into aws:dev Feb 14, 2017
for !mtask.waitEvent(cleanupTimeBool) {
}
stoppedSentBool := make(chan bool)
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the new written go routine, shall we start enforcing the rule of 'always pass "Context" to it', so that it can provide simple "cancelation"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no context wired in to most of our codebase as most of it was written before context existed. We could add that to the managedTask but that's outside the scope of my change here.

I don't think that always passing context is a hard rule that we should enforce. I think that whether or not we should use context depends on what the code is doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without context, there is no way to stop this potential "long run" NEW go routine (which could run for at worst 72 hours).

In case if there is another kind of state mismatch between agent and backend, backend thinks this instance is able to launch new task but agent is holding those "long run" cleanup GO routines. Is it possible, that agent could run out of memory ...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant "backend service" keep on starting a new task, and these task get stuck in "cleanup" state for 72 hours..., eventually will agent run out of memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the desired behavior. A successful submission of task state will result in this goroutine exiting. Unsuccessful submissions will delay until success or the timeout or 72 hours, whichever is sooner. There is no use-case to stop the goroutine other than this.

Copy link
Contributor

@petderek petderek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat. I had a few mostly curious questions — I don't think anything needs to change now.

// express or implied. See the License for the specific language governing
// permissions and limitations under the License.

package dockerstate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming? It may just be me, but when I saw this referenced in other bits of code it I assumed its part of docker's library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that this name is bad. I didn't want to change it as part of this PR though.


// Allow Task cleanup to occur
time.Sleep(2 * time.Second)
time.Sleep(5 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why time.Sleep? Is there an event we can listen for instead of setting an arbitrary time?

Is there a best practice around testing these sorts of interactions in Go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Event-based would be better, but we'd actually still want a timeout on the event. Part of the tests here are ensuring that cleanup happens within the time that is expected.

break
}
seelog.Warnf("Blocking cleanup for task %v until the task has been reported stopped. SentStatus: %v (%d/%d)", mtask, sentStatus, i+1, _maxStoppedWaitTimes)
mtask._time.Sleep(_stoppedSentWaitInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why are you using _time as a member of mtask instead of just using time.Sleep directly?
  • Why not use exponential backoff with the same 72 hour cap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • _time is a ttime.Time implementation that can be swapped out for tests. In unit tests, we inject a mock here that can let us verify and control the interactions that the code is doing with respect to time.
  • I didn't feel like exponential backoff was necessary or particularly desirable here; the only thing that this does is check the value of a variable. Constant delay is also a bit easier to read.

@adnxn adnxn mentioned this pull request Mar 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants