Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline how tasks stopped per ECS Control Plane #4301

Merged

Conversation

danehlim
Copy link
Contributor

@danehlim danehlim commented Aug 19, 2024

Summary

Streamline how tasks are stopped when ECS Control Plane indicates that a task should be stopped.

Currently ECS Control Plane indicates to ECS Data Plane via Agent Communication Service (ACS) that a task should be stopped in 2 ways:

  1. via a payload message
  2. via a task stop verification ACK

However, currently the workflow invoked/set of actions taken to actually stop a task on ECS Data Plane side differs for these 2 cases. We should streamline this to allow for consistency and better maintainability of code (e.g., not having to make changes in multiple places in the future).

Implementation details

  • Task engine AddTask method is currently overloaded. Factor out updating task in task engine logic into a separate method UpsertTask
    • Update testing mocks to factor in new task engine UpsertTask method
  • Refactor task engine updateTaskUnsafe method to:
    • take in as its 2nd argument the desired status to update the task to, instead of an entire update task. Taking in an entire update task is unnecessary since only the desired status of the update task is used
    • be renamed to updateTaskDesiredStatusUnsafe to be more specific/clear
  • When stopping a task in response to a task stop verification ACK received from ACS, replace logic to call task methods SetDesiredStatus and UpdateDesiredStatus on the task directly with calling task engine UpsertTask method instead
  • Remove integration test overrides of task engine taskSteadyStatePollInterval and taskSteadyStatePollIntervalJitter values as they are no longer necessary. This is because using UpsertTask to stop the task ensures that the ACS transition to stop the task is emitted on the task's acsMessages channel, and thus these values no longer affect how long TestTaskStopVerificationACKResponder* integration tests take to run (see discussion on previous pull request comment thread here for additional context)

Testing

Automated pull request tests.

Manually test using a custom ECS Agent built with the changes in this pull request against the internal bug repro environment mentioned in #4240 to ensure that the bug addressed by that pull request is still addressed with the changes in this pull request. With this custom ECS Agent, the aforementioned bug is still not observed AND we observe that the transition to stop the task is put on the task's ACS messages channel and stopping the container happens immediately after the transition is applied.

Partial manual testing ECS Agent logs {

"level": "debug",
"time": "2024-08-20T17:06:44Z",
"msg": "Received message of type: TaskStopVerificationAck"
}

{
"level": "debug",
"time": "2024-08-20T17:06:44Z",
"msg": "ACS activity occurred"
}

{
"level": "debug",
"time": "2024-08-20T17:06:44Z",
"msg": "Handling TaskStopVerificationACKMessage"
}

{
"level": "info",
"time": "2024-08-20T17:06:44Z",
"msg": "Sending message to task stopper to stop task",
"taskARN": "arn:aws:ecs:us-west-2:REDACTED:task/my-cluster/b176d749a3844107ab3823af0b34c18b"
}

{
"level": "info",
"time": "2024-08-20T17:06:44Z",
"msg": "Stopping task from task stop verification ACK: %s",
"taskARN": "arn:aws:ecs:us-west-2:REDACTED:task/my-cluster/b176d749a3844107ab3823af0b34c18b"
}

{
"level": "debug",
"time": "2024-08-20T17:06:44Z",
"msg": "Putting update on the acs channel",
"desiredStatus": "STOPPED",
"task": "b176d749a3844107ab3823af0b34c18b"
}

{
"level": "debug",
"time": "2024-08-20T17:06:44Z",
"msg": "Update taken off the acs channel",
"desiredStatus": "STOPPED",
"task": "b176d749a3844107ab3823af0b34c18b"
}

{
"level": "info",
"time": "2024-08-20T17:06:44Z",
"msg": "Managed task got acs event",
"desiredStatus": "STOPPED",
"seqnum": 0,
"task": "b176d749a3844107ab3823af0b34c18b"
}

{
"level": "info",
"time": "2024-08-20T17:06:44Z",
"msg": "New acs transition",
"desiredStatus": "STOPPED",
"seqnum": 0,
"task": "b176d749a3844107ab3823af0b34c18b"
}

{
"level": "info",
"time": "2024-08-20T17:06:44Z",
"msg": "Sleeping 45 seconds before applying acs transition (THIS IS ONLY DONE FOR INTERNAL BUG REPRO ENIVRONMENT)"
}

...
{
"level": "debug",
"time": "2024-08-20T17:07:29Z",
"msg": "Updating task's desired status",
"nContainers": 1,
"nENIs": 0,
"taskArn": "arn:aws:ecs:us-west-2:REDACTED:task/my-cluster/b176d749a3844107ab3823af0b34c18b",
"taskDesiredStatus": "STOPPED",
"taskFamily": "test-sleep",
"taskKnownStatus": "RUNNING",
"taskVersion": "3"
}

...
{
"level": "debug",
"time": "2024-08-20T17:07:29Z",
"msg": "Waiting for task event",
"task": "b176d749a3844107ab3823af0b34c18b"
}

{
"level": "info",
"time": "2024-08-20T17:07:29Z",
"msg": "Managed task got acs event",
"desiredStatus": "STOPPED",
"seqnum": 0,
"task": "b176d749a3844107ab3823af0b34c18b"
}

{
"level": "info",
"time": "2024-08-20T17:07:29Z",
"msg": "New acs transition",
"desiredStatus": "STOPPED",
"seqnum": 0,
"task": "b176d749a3844107ab3823af0b34c18b"
}

...
{
"level": "debug",
"time": "2024-08-20T17:07:29Z",
"msg": "Update taken off the acs channel",
"desiredStatus": "STOPPED",
"task": "b176d749a3844107ab3823af0b34c18b"
}

{
"level": "info",
"time": "2024-08-20T17:07:29Z",
"msg": "Stopping container",
"container": "sleepy300",
"task": "b176d749a3844107ab3823af0b34c18b"
}

New tests cover the changes: existing unit and integration tests updated

Description for the changelog

Streamline how tasks stopped per ECS Control Plane

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?

No

Does this PR include the addition of new environment variables in the README?

No

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@danehlim danehlim marked this pull request as ready for review August 20, 2024 00:04
@danehlim danehlim requested a review from a team as a code owner August 20, 2024 00:04
tinnywang
tinnywang previously approved these changes Aug 20, 2024
agent/engine/docker_task_engine.go Outdated Show resolved Hide resolved
agent/acs/session/task_stop_verification_ack_responder.go Outdated Show resolved Hide resolved
agent/engine/docker_task_engine.go Outdated Show resolved Hide resolved
agent/engine/interface.go Outdated Show resolved Hide resolved
@danehlim danehlim force-pushed the streamline-stopping-task-per-ecs-control-plane branch from ef068e9 to 1b9955d Compare August 23, 2024 23:43
amogh09
amogh09 previously approved these changes Aug 23, 2024
@danehlim danehlim merged commit d6e4c7a into aws:dev Aug 24, 2024
40 checks passed
@danehlim danehlim mentioned this pull request Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants