-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dockerstate: map keys differ before create #1033
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but 'RemoveTask" should have a unit test that validates this change.
Yep, that's why this is marked WIP right now. Tests are forthcoming (probably tomorrow). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, will approve it when its not wip anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
Please add tests before merging!
I've modified some unit tests and also added a new functional test to ensure we don't regress behavior here. This test fails with the v1.14.5 agent. Passing:
Failing:
|
assert.Equal(t, *testTask.TaskArn, resp.Arn, "arn should be equal") | ||
|
||
// wait two minutes for it to be cleaned up | ||
fmt.Println("Sleeping...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: remove this? Or use a logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt.Println
will print out immediately (when running tests for a single package) while t.Log
buffers until the test ends. I had wanted to print this out during the run so I could inspect the agent logs while it was sleeping. I'm inclined to leave this in right now, but I can remove it if you feel strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, you can leave this be. It's functional tests and it doesn't matter if log out to stdout here immediately.
delete(state.idToContainer, dockerContainer.DockerID) | ||
// The key to these maps is either the Docker ID or agent-generated name. We use the agent-generated name | ||
// before a Docker ID is available. | ||
key := dockerContainer.DockerID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be safer to move this out of this method into a getKey(container)
method (so that future modifications in this method do not touch that critical piece of business logic). what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code that adds to the map also removes references during a subsequent call. I'll extract both operations out to explicit functions with comments explaining why they're doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM minor comemnts
CHANGELOG.md
Outdated
@@ -6,6 +6,9 @@ | |||
[#1014](https://github.com/aws/amazon-ecs-agent/pull/1014) | |||
* Enhancement - Support `init` process in containers by adding support for Docker remote API client version 1.25 | |||
[#996](https://github.com/aws/amazon-ecs-agent/pull/996) | |||
* Bug - Fixed a bug where tasks that fail to pull containers can cause the | |||
agent to fail to restore properly after a restart. | |||
[#1024](https://github.com/aws/amazon-ecs-agent/issues/1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this link to the pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this before the PR was created. I can update it.
@@ -166,8 +166,12 @@ func (agent *TestAgent) StartAgent() error { | |||
Links: agent.Options.ContainerLinks, | |||
} | |||
|
|||
if os.Getenv("ECS_FTEST_FORCE_NET_HOST") != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this environment variable being set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not automatically set. I added this because I was developing the test in an environment where the docker0 bridge did not have Internet access (for unrelated reasons) and wanted the agent to actually be able to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then can you add this into the agent/functional_tests/README.d
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I rewrote a bit more of the README to make it more clear.
The state.idToTask and state.idToContainer maps initially have the keys set to the generated DockerName rather than the DockerID. Once containers are created, the keys are changed to the DockerID. This changed as part of this commit: aws@49c36c7#diff-464fc7e15d1a6b818ceca89e7b68cd4e.
This test verifies that the agent can be restarted after a task with an invalid image is cleaned up. This test fails with v1.14.5 and passes after this commit.
f52cf53
to
fa0662e
Compare
@aaithal @richardpen Can you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sail boat
The state.idToTask and state.idToContainer maps initially have the keys
set to the generated DockerName rather than the DockerID. Once
containers are created, the keys are changed to the DockerID. This
changed as part of this commit:
49c36c7#diff-464fc7e15d1a6b818ceca89e7b68cd4e.
Summary
Fixes #1024
If a task is run where the image fails to pull, the agent will get into an unrecoverable state after it cleans up the task. This happens because it fails to delete an entry in the
state.idToTask
andstate.idToContainer
maps as it attempts to delete with the wrong key.Implementation details
The state.idToTask and state.idToContainer maps initially have the keys set to the generated DockerName rather than the DockerID. Once containers are created, the keys are changed to the DockerID. This changed as part of this commit: 49c36c7#diff-464fc7e15d1a6b818ceca89e7b68cd4e.
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes: yes
Description for the changelog
Bug - Fixed a bug where tasks that fail to pull containers can cause the agent to fail to restore properly after a restart. #1024
Licensing
This contribution is under the terms of the Apache 2.0 License: yes