-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix concurrent writes during image cleanup #743
Conversation
@@ -951,3 +953,68 @@ func TestGetImageStateFromImageNameNoImageState(t *testing.T) { | |||
t.Error("Incorrect image state retrieved by image name") | |||
} | |||
} | |||
|
|||
func TestConcurrentRemoveUnusedImages(t *testing.T) { | |||
// NOTE: Test would fail without the corresponding fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to put this line into the description of PR.
} | ||
require.Equal(t, 1, len(imageManager.imageStates)) | ||
|
||
err = imageManager.RemoveContainerReferenceFromImageState(container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here to explain why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the standard setup in tests of similar nature here.
They help trigger the use-case under test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vsiddharth Please add a comment in code to describe the same
} | ||
|
||
imageState, _ := imageManager.getImageState(imageInspected.ID) | ||
imageState.RemoveImageName(container.Image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why this line is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get rid of this line by changing client.EXPECT().RemoveImage(sourceImage.ImageID, removeImageTimeout).Return(nil)
to client.EXPECT().RemoveImage(container.Image, removeImageTimeout).Return(nil)
. Because image manager will try to delete the image by name first, if there is no name then it will delete by id.
agent/engine/docker_image_manager.go
Outdated
@@ -277,6 +275,8 @@ func (imageManager *dockerImageManager) performPeriodicImageCleanup(ctx context. | |||
} | |||
|
|||
func (imageManager *dockerImageManager) removeUnusedImages() { | |||
imageManager.updateLock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document if you've observed any slowness because of this in task launch latencies on an instance where there are at least 5 images to clean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not observe any significant latencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github does not let me add comments on code that needs to modified outside this PR, so commenting here instead. It'd be a lot better if we could add some logging here because this method acquires at least 3 locks during its lifetime. Specifically, a log line to indicate that we are cleaning all tracking information for the image name because it has 0 references to it at line num 351 (after the if
loop)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this mostly looks good, though it's hard to see the codepath in the diff (I had to open an editor and trace through it). Please address @richardpen's comment as well as mine.
require.Equal(t, 1, len(imageManager.imageStates)) | ||
|
||
err = imageManager.RemoveContainerReferenceFromImageState(container) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to assert.NoError
?
client.EXPECT().RemoveImage(container.Image, removeImageTimeout).Return(nil) | ||
require.Equal(t, 1, len(imageManager.imageStates)) | ||
|
||
numRoutines := 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason you picked 1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a significant reason for this choice. Its something used to recreate the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment to that effect. Right now 1000 is just a magic number in the source code.
@vsiddharth Do you plan to address the comments here? @richardpen still has one requested change and both @aaithal and I have asked questions. |
@richardpen @aaithal @samuelkarp I've responded to most of the comments a few days back. Please let me know if you have any further questions. Thanks |
agent/engine/docker_image_manager.go
Outdated
@@ -277,6 +275,8 @@ func (imageManager *dockerImageManager) performPeriodicImageCleanup(ctx context. | |||
} | |||
|
|||
func (imageManager *dockerImageManager) removeUnusedImages() { | |||
imageManager.updateLock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github does not let me add comments on code that needs to modified outside this PR, so commenting here instead. It'd be a lot better if we could add some logging here because this method acquires at least 3 locks during its lifetime. Specifically, a log line to indicate that we are cleaning all tracking information for the image name because it has 0 references to it at line num 351 (after the if
loop)
agent/engine/docker_image_manager.go
Outdated
@@ -348,6 +349,7 @@ func (imageManager *dockerImageManager) deleteImage(imageID string, imageState * | |||
seelog.Infof("Image removed: %v", imageID) | |||
imageState.RemoveImageName(imageID) | |||
if len(imageState.Image.Names) == 0 { | |||
seelog.Infof("Cleaning up all tracking information for image %v as it has zero references", imageID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please change %v
to %s
before merging this.
This patch inspects the cleanup and resolves the inherent concurrent map write issue reported. An unit test has been added to increase confidence in the fix. Fixes aws#707 Signed-off-by: Vinothkumar Siddharth <[email protected]>
ac820c5
to
b019dfb
Compare
@samuelkarp Could you have another look at this PR ? |
@vsiddharth What's failing in the integration tests on Windows? |
@samuelkarp I've observed the windows failures and that is due to a flaky test that we already track. |
@vsiddharth which test is it exactly? |
The test that failed was |
Merged with this commit |
This patch inspects the cleanup and resolves the inherent concurrent map write
issue reported. An unit test has been added to increase confidence in the fix.
Fixes #707
Signed-off-by: Vinothkumar Siddharth [email protected]
Summary
Implementation details
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes:
Description for the changelog
Licensing
This contribution is under the terms of the Apache 2.0 License: