-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
engine: introduce a new env var to distinct pull image behavior #1348
Conversation
CHANGELOG.md
Outdated
@@ -1,5 +1,8 @@ | |||
# Changelog | |||
|
|||
## 1.18.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention is to use 1.18.0-dev
before release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
agent/engine/task_manager.go
Outdated
mtask.cfg.AgentPullBehavior == config.OnceAgentPullBehavior { | ||
seelog.Criticalf("Managed task [%s]: Error while pulling container %s, task will fail: %v", | ||
mtask.Arn, container.Name, event.Error) | ||
mtask.SetDesiredStatus(api.TaskStopped) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the task will be moved to STOPPED even if the container is not essential. Is that what we are implying by using the config variable? If yes, we might want to document that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essential or non-essential is only effective for RUNNING
containers. It shouldn't have a bearing on Pulled
transition failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task will stop for both essential or non-essential containers. I will add some comments for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, when a container fails with desired CREATED
, we move the task to STOPPED
only if it is an essential container. We should do the same for PULLED
too.
If not, we might want to change the logic to stop the task for failures during CREATED
, irrespective of the essential status of containers.
@haikuoliu can you please fix failing tests? |
@aaithal It's the timeout test We talked about this in the wire context PR #1329: to fix it either use a mock context (implemented in the initial PR), or use a cancel context to simulate a timeout context (in one of the revised PR), or don't fix it at all (in final PR), we choose the last method. |
That doesn't seem right. I think you 'fix' it by adding |
CHANGELOG.md
Outdated
@@ -1,5 +1,8 @@ | |||
# Changelog | |||
|
|||
## 1.18.0 | |||
* Enhancement - Introduce a new environment variable `ECS_AGENT_PULL_BEHAVIOR` to make agent pull behavior configurable [#to be added](to be added) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can actually add the to be added
info now. Also, this falls under the Feature
bucket, not Enhancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
README.md
Outdated
@@ -169,6 +169,7 @@ additional details on each available environment variable. | |||
| `ECS_IMAGE_CLEANUP_INTERVAL` | 30m | The time interval between automated image cleanup cycles. If set to less than 10 minutes, the value is ignored. | 30m | 30m | | |||
| `ECS_IMAGE_MINIMUM_CLEANUP_AGE` | 30m | The minimum time interval between when an image is pulled and when it can be considered for automated image cleanup. | 1h | 1h | | |||
| `ECS_NUM_IMAGES_DELETE_PER_CYCLE` | 5 | The maximum number of images to delete in a single automated image cleanup cycle. If set to less than 1, the value is ignored. | 5 | 5 | | |||
| `ECS_AGENT_PULL_BEHAVIOR` | <default | always | never | once> | Behavior used to customize the pull image process. default: pull image remotely, use cached image if the pull fails; always: only pull image remotely, the task fails when the pull fails; never: only use cached image; once: pull image remotely only if this image has never been pulled before, use cached image otherwise. | default | default | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please get @nrdlngr to review this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will ask review from him.
agent/config/config.go
Outdated
@@ -86,6 +86,23 @@ const ( | |||
// performing image cleanup. | |||
minimumNumImagesToDeletePerCycle = 1 | |||
|
|||
// DefaultAgentPullBehavior specficies the behavior that if an image pull API call fails, | |||
// agent tries to start from the Docker image cache anyway, assuming that the image has not changed. | |||
DefaultAgentPullBehavior AgentPullBehaviorType = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please modify this be to in its own const block and use iota to be more idiomatic: https://golang.org/ref/spec#Iota
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sure.
agent/config/types.go
Outdated
@@ -20,6 +20,8 @@ import ( | |||
cnitypes "github.com/containernetworking/cni/pkg/types" | |||
) | |||
|
|||
type AgentPullBehaviorType int8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/engine/docker_task_engine.go
Outdated
default: | ||
// If the agent pull behavior is never, we don't try to pull the image, | ||
// try to use local image cache instead. | ||
if engine.cfg.AgentPullBehavior == config.NeverAgentPullBehavior { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please refactor these into a method? May be something that returns a boolean to indicate if you should return dockerapi.DockerContainerMetadata{Error: nil}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will abstract it out.
agent/engine/image/types.go
Outdated
Containers []*api.Container `json:"-"` | ||
PulledAt time.Time | ||
LastUsedAt time.Time | ||
IsPulledSuccess bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PullSucceeded
seems like a better name for this. Also, please add documentation for this and all other fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sure.
agent/engine/task_manager.go
Outdated
mtask.cfg.AgentPullBehavior == config.OnceAgentPullBehavior { | ||
seelog.Criticalf("Managed task [%s]: Error while pulling container %s, task will fail: %v", | ||
mtask.Arn, container.Name, event.Error) | ||
mtask.SetDesiredStatus(api.TaskStopped) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essential or non-essential is only effective for RUNNING
containers. It shouldn't have a bearing on Pulled
transition failing
@haikuoliu you also need to update the version and documentation for the agent's state file in
|
@aaithal My bad, thanks for pointing out. I will fix it. Will also update the version and documentation for state_manager.go |
Update: fix the timeout test, add more comments, update change log, and some other refactor. Still waiting for review of README.md. |
CHANGELOG.md
Outdated
@@ -1,5 +1,8 @@ | |||
# Changelog | |||
|
|||
## 1.18.0-dev | |||
* Feature - Introduce a new environment variable `ECS_AGENT_PULL_BEHAVIOR` to make agent pull behavior configurable [#1348](https://github.com/aws/amazon-ecs-agent/pull/1348) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feature - Configurable agent pull behavior [#1348]..
seems like a better description for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/config/config.go
Outdated
@@ -325,6 +344,7 @@ func environmentConfig() (Config, error) { | |||
imageCleanupInterval := parseEnvVariableDuration("ECS_IMAGE_CLEANUP_INTERVAL") | |||
numImagesToDeletePerCycleEnvVal := os.Getenv("ECS_NUM_IMAGES_DELETE_PER_CYCLE") | |||
numImagesToDeletePerCycle, err := strconv.Atoi(numImagesToDeletePerCycleEnvVal) | |||
agentPullBehavior := getAgentPullBehavior() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this variable. Just set AgentPullBehavior
to getAgentPullBehavior()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to be consistent with other code. I just saw your refactor config commit, will change accordingly.
agent/config/config.go
Outdated
@@ -455,6 +476,24 @@ func getContainerStartTimeout() time.Duration { | |||
return containerStartTimeout | |||
} | |||
|
|||
func getAgentPullBehavior() AgentPullBehaviorType { | |||
var agentPullBehavior AgentPullBehaviorType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can get rid of this variable as well:
func getAgentPullBehavior() AgentPullBehaviorType {
switch agentPullBehaviorString {
case "always":
return AlwaysAgentPullBehavior
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/config/config_test.go
Outdated
@@ -369,6 +371,14 @@ func TestImageCleanupMinimumNumImagesToDeletePerCycle(t *testing.T) { | |||
assert.Equal(t, cfg.NumImagesToDeletePerCycle, DefaultNumImagesToDeletePerCycle, "Wrong value for NumImagesToDeletePerCycle") | |||
} | |||
|
|||
func TestInvalidAgentPullBehavior(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for just getAgentPullBehavior()
as well? Make it a table driven test so that you can test that all 4 valid values and an invalid value for ECS_AGENT_PULL_BEHAVIOR
return the expected output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/engine/docker_task_engine.go
Outdated
@@ -741,16 +776,23 @@ func (engine *DockerTaskEngine) pullAndUpdateContainerReference(task *api.Task, | |||
if container.IsInternal() { | |||
return metadata | |||
} | |||
PullSucceeded := metadata.Error == nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this PullSucceeded
and not pullSucceeded
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be pullSucceeded
.
agent/engine/docker_task_engine.go
Outdated
// (the image can be prepopulated with the AMI and never be pulled). | ||
imageState := engine.imageManager.GetImageStateFromImageName(container.Image) | ||
if imageState != nil && imageState.PullSucceeded { | ||
seelog.Infof("Task engine [%s]: image %s has been pulled, use it directly for container %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd suggest changing this slightly to :image %s has been pulled once, not pulling it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/engine/docker_task_engine.go
Outdated
taskArn string) bool { | ||
// If the agent pull behavior is never, we don't try to pull the image, | ||
// try to use local image cache instead. | ||
if agentPullBehavior == config.NeverAgentPullBehavior { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: this'd be more readable with a switch
and cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/engine/docker_task_engine.go
Outdated
// otherwise pull the image as usual, regardless whether the image exists or not | ||
// (the image can be prepopulated with the AMI and never be pulled). | ||
imageState := engine.imageManager.GetImageStateFromImageName(container.Image) | ||
if imageState != nil && imageState.PullSucceeded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all references to PullSucceeded
need to be protected with a lock. Please add setters and getters in image state struct for that use those methods everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/engine/task_manager.go
Outdated
// where there is no local image cache will be handled when creating container. | ||
if mtask.cfg.AgentPullBehavior == config.AlwaysAgentPullBehavior || | ||
mtask.cfg.AgentPullBehavior == config.OnceAgentPullBehavior { | ||
seelog.Criticalf("Managed task [%s]: Error while pulling container %s, task will fail: %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should probably log the container image here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/config/config.go
Outdated
@@ -455,6 +476,24 @@ func getContainerStartTimeout() time.Duration { | |||
return containerStartTimeout | |||
} | |||
|
|||
func getAgentPullBehavior() AgentPullBehaviorType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should rebase on top my pr #1353 once it's merged so that this gets moved to parse.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/config/config.go
Outdated
const ( | ||
// DefaultAgentPullBehavior specficies the behavior that if an image pull API call fails, | ||
// agent tries to start from the Docker image cache anyway, assuming that the image has not changed. | ||
DefaultAgentPullBehavior AgentPullBehaviorType = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImagePullDefault
ImagePullAlwasy
ImagePullNever
ImagePullOnce
seems to be better than this, can you change the name to be more readable? Also, can you move this into the types.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can change the name, but I think they are more of constants like DefaultClusterName
and DefaultTaskCleanupWaitDuration
, so they should be kept together with other constants defined in config.go.
agent/config/config.go
Outdated
func getAgentPullBehavior() AgentPullBehaviorType { | ||
var agentPullBehavior AgentPullBehaviorType | ||
agentPullBehaviorString := os.Getenv("ECS_AGENT_PULL_BEHAVIOR") | ||
switch agentPullBehaviorString { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to format the string a little bit before comparing, like trim space, convert all character to lowercase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, Good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I think this has already been done by func (cfg *Config) trimWhitespace()
agent/engine/docker_image_manager.go
Outdated
@@ -154,6 +154,9 @@ func (imageManager *dockerImageManager) addContainerReferenceToNewImageState(con | |||
Image: sourceImage, | |||
PulledAt: time.Now(), | |||
LastUsedAt: time.Now(), | |||
// The PullSucceeded filed is false by default, | |||
// one has to explicitly set it to be true when the pull image succeeds. | |||
PullSucceeded: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help me understand: the addContainerReferenceToNewImageState
is called after the image is pulled succeed, why PullSucceeded
was set to false, what will happen if it was set to true here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because even if the image pull fails, the container reference will still be updated, so we set it to be false here. We check if there is no error with image pull, we then set it to be true.
agent/engine/docker_task_engine.go
Outdated
seelog.Infof("Task engine [%s]: use cached image directly for container %s", | ||
taskArn, container.Name) | ||
// It's not an internal container, so it's safe to add container reference. | ||
engine.updateContainerReference(false, container, taskArn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, do we even need to track the image state? If it's not pulled by agent, we shouldn't remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardpen and I discussed about the problem we may face about image cleaning. If we use never
or once
agent behavior setup, the image cleanup process could remove the image, and I think for never
and once
we should keep the image cache, right?
We think what we can do:
-
Let customer know that there may be conflicts between image cleaning and agent pull behavior setup, and suggest them when using
never
oronce
, setNumImagesToDeletePerCycle
to be 0 (or some other image cleanup settings that can prevent image from being removed). -
Disable image cleanup internally when using
never
oronce
, add some documentation on that so that user knows.
@sharanyad @richardpen @aaithal what do you think about this?
agent/engine/task_manager.go
Outdated
mtask.Arn, container.Name, event.Error) | ||
// The task should be stopped regardless of whether this container is | ||
// essential or non-essential. | ||
mtask.SetDesiredStatus(api.TaskStopped) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you emitting the STOPPED
event, the task may not be STOPPED
at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, don't notify ACS right now is better.
agent/engine/docker_task_engine.go
Outdated
// useLocalImageCache returns true if local image cache should be used, or return false to continue | ||
// pulling image, by inspecting the agent pull behavior variable defined in config. The caller has | ||
// to make sure the container passed in is not an internal container. | ||
func (engine *DockerTaskEngine) useLocalImageCache(agentPullBehavior config.AgentPullBehaviorType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming this to ShouldUseLocalImage
might be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/engine/docker_task_engine.go
Outdated
// (the image can be prepopulated with the AMI and never be pulled). | ||
imageState := engine.imageManager.GetImageStateFromImageName(container.Image) | ||
if imageState != nil && imageState.PullSucceeded { | ||
seelog.Infof("Task engine [%s]: image %s has been pulled, use it directly for container %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we call updateContainerReference
in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you are right, we should.
agent/engine/docker_task_engine.go
Outdated
@@ -741,16 +776,23 @@ func (engine *DockerTaskEngine) pullAndUpdateContainerReference(task *api.Task, | |||
if container.IsInternal() { | |||
return metadata | |||
} | |||
PullSucceeded := metadata.Error == nil | |||
engine.updateContainerReference(PullSucceeded, container, task.Arn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this logic be handled here -
if err != nil { |
i.e, depending on
InspectImage
, set PullSucceeded
value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think inspect image and pull image are two different things, when you inspect an image, the image can be there because it' reconfigured with the AMI, instead of being pulled from user's ECR, this matters when the setting is once
.
assert.Equal(t, dockerapi.DockerContainerMetadata{}, metadata, "expected empty metadata") | ||
} | ||
|
||
func TestPullImageWithOnceAgentPullBehavior(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test where config is once
and PullSucceeded
is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
2d212de
to
7e52650
Compare
Update: address the comments. Introduce a new behavior "prefer_cached". Will update README, add comments and tests for the new behavior and remove "never" behavior due to the change of the requirement. |
7e52650
to
8af18ee
Compare
Update: Update README, add comments and tests for "prefer_cached" behavior and remove "never" behavior due to the change of the requirement. Also do a rebase. |
} | ||
|
||
func TestParseImagePullBehavior(t *testing.T) { | ||
testcases := []struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -117,6 +117,22 @@ func parseNumImagesToDeletePerCycle() int { | |||
return numImagesToDeletePerCycle | |||
} | |||
|
|||
func parseImagePullBehavior() ImagePullBehaviorType { | |||
ImagePullBehaviorString := os.Getenv("ECS_IMAGE_PULL_BEHAVIOR") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we .ToLowerCase() the string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we use trimWhitespace
to format every env var string and we should not do extra work outside of this method:
amazon-ecs-agent/agent/config/config.go
Line 187 in 159ae5c
func (cfg *Config) trimWhitespace() { |
@@ -20,6 +20,10 @@ import ( | |||
cnitypes "github.com/containernetworking/cni/pkg/types" | |||
) | |||
|
|||
// ImagePullBehaviorType is an enum variable type corresponding to different agent pull | |||
// behaviors including default, always, never and once. | |||
type ImagePullBehaviorType int8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why an int8 over an int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's just an enum variable, which doesn't require a large range, there is usage of int8
elsewhere in agent:
type updateStage int8 |
agent/engine/docker_image_manager.go
Outdated
// If the image pull behavior is prefer cached, don't clean up the image, | ||
// because the cached image is needed. | ||
if imageManager.imagePullBehavior != config.ImagePullPreferCachedBehavior { | ||
// passing the cleanup interval as argument which would help during testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[do this!]
log that you are disabling this.
[nit]
I think I'd prefer positive comparison instead of !=.
if imageManager.imagePullBehavior == config.ImagePullPreferCachedBehavior {
seelog.Info("Pull behavior is set to always use cache. Disabling cleanup")
return
}
imageManager.performPeriodicImageCleanup(ctx, imageManager.imageCleanupTimeInterval)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sure.
agent/engine/docker_task_engine.go
Outdated
case config.ImagePullPreferCachedBehavior: | ||
// If the behavior is prefer_cached, don't pull if we found cached image | ||
// by inspecting the image. | ||
_, err := engine.client.InspectImage(container.Image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need a timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the docker client API does not need a timeout for inspecting an image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove this from the 1.17.4 milestone? We can sync offline about reasons to do so.
agent/config/parse.go
Outdated
return ImagePullAlwaysBehavior | ||
case "once": | ||
return ImagePullOnceBehavior | ||
case "prefer_cached": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change this to preferCached
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found there are Example values: ["awslogs","fluentd","gelf","json-file","journald","splunk"]
and Example values: {"custom attribute": "custom_attribute_value"}
for a var with multiple words from official docs, but I did not find a camelCase one, I will double check with Joel about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are examples. These are key words that we're defining. All of our APIs are camlCased and don't have underscroes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, I don't think the APIs matter because this is a value rather than a parameter. My opinion is to use "prefer-cached".
agent/engine/docker_task_engine.go
Outdated
func (engine *DockerTaskEngine) shouldUseLocalImage(ImagePullBehavior config.ImagePullBehaviorType, | ||
container *api.Container, | ||
taskArn string) bool { | ||
switch ImagePullBehavior { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function param names should never begin with upper case letters. For a second, I got freaked out that you were comparing a const here. Please modify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/engine/docker_task_engine.go
Outdated
// otherwise pull the image as usual, regardless whether the image exists or not | ||
// (the image can be prepopulated with the AMI and never be pulled). | ||
imageState := engine.imageManager.GetImageStateFromImageName(container.Image) | ||
if imageState != nil && imageState.GetPullSucceeded() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nil
check for existence of something in a map somewhere isn't a good paradigm. Please change the return type of GetImageStateFromImageName
to be (ImageState, bool)
instead of just *ImageState
and use the bool here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
agent/engine/docker_task_engine.go
Outdated
// shouldUseLocalImage returns true if local image cache should be used, or return false to continue | ||
// pulling image, by inspecting the agent pull behavior variable defined in config. The caller has | ||
// to make sure the container passed in is not an internal container. | ||
func (engine *DockerTaskEngine) shouldUseLocalImage(ImagePullBehavior config.ImagePullBehaviorType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The selective action that a caller needs to perform when calling this method is to pull the image if this returns a negative value. Essentially !if shouldUseLocalImage() { // pull.. }
. It's much more intuitive if we operated in the reverse manner. Instead of shouldUseLocalImage
, please make this imagePullRequired
so that you can do if imagePullRequired() { // pull.. }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
agent/engine/docker_task_engine.go
Outdated
} | ||
imageState := engine.imageManager.GetImageStateFromImageName(container.Image) | ||
if imageState != nil && pullSucceeded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too, as per above comment, please use the boolean returned from GetImageStateFromImageName
to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/engine/image/types.go
Outdated
// PullSucceeded defines whether this image has been pulled successfully before, | ||
// this should be set to true when one of the pull image call succeeds. | ||
PullSucceeded bool | ||
updateLock sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rename this to just lock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
const ( | ||
// ImagePullDefaultBehavior specifies the behavior that if an image pull API call fails, | ||
// agent tries to start from the Docker image cache anyway, assuming that the image has not changed. | ||
ImagePullDefaultBehavior ImagePullBehaviorType = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be just ImagePullDefault
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ImagePullDefault
is more general, it describes the ImagePull
but does not specify which aspect it's describing. I wouldn't know if it's ImagePull default timeout, or ImagePull default behavior, etc. So ImagePullDefaultBehavior
makes more sense to me. Let me know if you have more comments on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImagePullDefault
would be of the type ImagePullBehaviorType
which would convey the meaning. Behavior
suffix can be removed. Also the type can be just ImagePullType
something like:
ImagePullDefault ImagePullType = iota
ImagePullAlways
and the environment variable can be ECS_IMAGE_PULL_TYPE
. But I would suggest consulting Joel for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah my concern is:
If I name it ImagePullDefault
, although it's of ImagePullBehaviorType
, but if some day in the future, we have something like ImagePullTimeoutType
(just for example) as an enum type, and it also has a default value, so it should be named as ImagePullDefault
to be consistent, which conflicts what we have now, does this make sense to you?
|
||
// ImagePullAlwaysBehavior specifies the behavior that if an image pull API call fails, | ||
// the task fails instead of using cached image. | ||
ImagePullAlwaysBehavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be just ImagePullAlways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
// ImagePullOnceBehavior specifies the behavior that agent will only attempt to pull | ||
// the same image once, once an image is pulled, local image cache will be used | ||
// for all the containers. | ||
ImagePullOnceBehavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be just ImagePullOnce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
|
||
// ImagePullPreferCachedBehavior specifies the behavior that agent will only attempt to pull | ||
// the image if there is no cached image. | ||
ImagePullPreferCachedBehavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be just ImagePullPreferCached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
agent/engine/docker_image_manager.go
Outdated
@@ -155,6 +156,9 @@ func (imageManager *dockerImageManager) addContainerReferenceToNewImageState(con | |||
Image: sourceImage, | |||
PulledAt: time.Now(), | |||
LastUsedAt: time.Now(), | |||
// The PullSucceeded filed is false by default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/engine/docker_image_manager.go
Outdated
@@ -155,6 +156,9 @@ func (imageManager *dockerImageManager) addContainerReferenceToNewImageState(con | |||
Image: sourceImage, | |||
PulledAt: time.Now(), | |||
LastUsedAt: time.Now(), | |||
// The PullSucceeded filed is false by default, | |||
// one has to explicitly set it to be true when the pull image succeeds. | |||
PullSucceeded: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool is false by default. We need not explicitly set it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
agent/engine/task_manager.go
Outdated
// don't want to use cached image for both cases. | ||
if mtask.cfg.ImagePullBehavior == config.ImagePullAlwaysBehavior || | ||
mtask.cfg.ImagePullBehavior == config.ImagePullOnceBehavior { | ||
seelog.Criticalf("Managed task [%s]: Error while pulling container %s and image %s, task will fail: %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be Errorf
and the message can be Managed task [%s]: Error while pulling image %s for container %s , moving task to STOPPED:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
8af18ee
to
d67deff
Compare
Update: rebase and refactor to address the comments. Currently I am not sure about when this feature should be released, so I keep it 1.17.4 in changelog, will update the changelog before merged. |
README.md
Outdated
@@ -169,6 +169,7 @@ additional details on each available environment variable. | |||
| `ECS_IMAGE_CLEANUP_INTERVAL` | 30m | The time interval between automated image cleanup cycles. If set to less than 10 minutes, the value is ignored. | 30m | 30m | | |||
| `ECS_IMAGE_MINIMUM_CLEANUP_AGE` | 30m | The minimum time interval between when an image is pulled and when it can be considered for automated image cleanup. | 1h | 1h | | |||
| `ECS_NUM_IMAGES_DELETE_PER_CYCLE` | 5 | The maximum number of images to delete in a single automated image cleanup cycle. If set to less than 1, the value is ignored. | 5 | 5 | | |||
| `ECS_IMAGE_PULL_BEHAVIOR` | <default | always | once | prefer-cached > | The behavior used to customize the pull image process. If `default` is specified, the image will be pulled remotely, if the pull fails then the cached image will be used. If `always` is specified, the image will be pulled remotely, if the pull fails then the task will fail. If `once` is specified, the image will be pulled remotely if it has not been pulled before, otherwise the cached image will be used. If `prefer-cached` is specified, the image will be pulled remotely if there is no cached image, otherwise the cached image will be used. | default | default | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this reviewed by @nrdlngr? Also for once
it should be: the image will be pulled remotely if it has not been pulled or it has been removed by image cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README and env vars is reviewed by @joelbrandenburg. I think the image cleanup related stuff should be reflected in the tech doc, I will let @joelbrandenburg know about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should also be reflected here, otherwise the description here is not correct and could cause confusion.
// If the image pull behavior is prefer cached, don't clean up the image, | ||
// because the cached image is needed. | ||
if imageManager.imagePullBehavior == config.ImagePullPreferCachedBehavior { | ||
seelog.Info("Pull behavior is set to always use cache. Disabling cleanup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we point this out in the readme file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already told @joelbrandenburg about this, this should be also included in tech doc I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be valuable to add a link to the docs in README
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will ask @joelbrandenburg if it's ok to have a link in readme. I will attach the link in readme if he has the link ready before I merge, otherwise I will attach it some time after I merge.
} else { | ||
t.Logf("Found image state for %s", test1Image2Name) | ||
t.Fatalf("Could not find image state for %s", test1Image1Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are here, can you change these to use assert
or require
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
return engine.serialPull(task, container) | ||
if engine.imagePullRequired(engine.cfg.ImagePullBehavior, container, task.Arn) { | ||
// Record the pullStoppedAt timestamp | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case what's the value of task PullStartedAt and pullStoppedAt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite get the question, the logic in imagePullRequired
is to pull image, instead of using local cache, hence recording PullStartedAt
and pullStoppedAt
is necessary. (previously it's shouldUseLocalImage, and this is changed to imagePullRequired
as per @aaithal's comments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if engine.imagePullRequired
is false, then the image pullstartedAt and pullStoppedAt won't be set, right? Can you run a task in this scenario and check the pullstoppedAt and pullstartedat from describetask API that it isn't set to something unexpected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I see what you mean, I saw the comment here:
amazon-ecs-agent/agent/api/task.go
Line 112 in 159ae5c
// it won't be set if the pull never happens |
And I test the describe task API (which you have seen), it works as expected, the
pull started at
and pull stopped at
will not show up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pull started at and pull stopped at will now show up.
If image is not pulled, these values shouldn't show up right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's not show up, typed wrong word.
agent/engine/image/types.go
Outdated
|
||
// GetPullSucceeded safely returns the PullSucceeded of the imageState | ||
func (imageState *ImageState) GetPullSucceeded() bool { | ||
imageState.lock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a read lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
d67deff
to
c7090f8
Compare
Update: Address @richardpen 's comments, and squash the commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
almost there!
README.md
Outdated
@@ -169,6 +169,7 @@ additional details on each available environment variable. | |||
| `ECS_IMAGE_CLEANUP_INTERVAL` | 30m | The time interval between automated image cleanup cycles. If set to less than 10 minutes, the value is ignored. | 30m | 30m | | |||
| `ECS_IMAGE_MINIMUM_CLEANUP_AGE` | 30m | The minimum time interval between when an image is pulled and when it can be considered for automated image cleanup. | 1h | 1h | | |||
| `ECS_NUM_IMAGES_DELETE_PER_CYCLE` | 5 | The maximum number of images to delete in a single automated image cleanup cycle. If set to less than 1, the value is ignored. | 5 | 5 | | |||
| `ECS_IMAGE_PULL_BEHAVIOR` | <default | always | once | prefer-cached > | The behavior used to customize the pull image process. If `default` is specified, the image will be pulled remotely, if the pull fails then the cached image will be used. If `always` is specified, the image will be pulled remotely, if the pull fails then the task will fail. If `once` is specified, the image will be pulled remotely if it has not been pulled before, otherwise the cached image will be used. If `prefer-cached` is specified, the image will be pulled remotely if there is no cached image, otherwise the cached image will be used. | default | default | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to specify that the cached image in the instance will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
// If the image pull behavior is prefer cached, don't clean up the image, | ||
// because the cached image is needed. | ||
if imageManager.imagePullBehavior == config.ImagePullPreferCachedBehavior { | ||
seelog.Info("Pull behavior is set to always use cache. Disabling cleanup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be valuable to add a link to the docs in README
.
return engine.serialPull(task, container) | ||
if engine.imagePullRequired(engine.cfg.ImagePullBehavior, container, task.Arn) { | ||
// Record the pullStoppedAt timestamp | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pull started at and pull stopped at will now show up.
If image is not pulled, these values shouldn't show up right?
agent/engine/task_manager.go
Outdated
// don't want to use cached image for both cases. | ||
if mtask.cfg.ImagePullBehavior == config.ImagePullAlwaysBehavior || | ||
mtask.cfg.ImagePullBehavior == config.ImagePullOnceBehavior { | ||
seelog.Errorf("Managed task [%s]: Error while pulling image %s for container %s , moving task to STOPPED: %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Managed task [%s]: error while
same for below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
CHANGELOG.md
Outdated
@@ -1,6 +1,7 @@ | |||
# Changelog | |||
|
|||
## 1.17.4-dev | |||
* Feature - Configurable agent pull behavior [#1348](https://github.com/aws/amazon-ecs-agent/pull/1348) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please change this to Configurable container image pull behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, otherwise lgtm. Also please make sure all these behaviors are manually tested.
README.md
Outdated
@@ -169,6 +169,7 @@ additional details on each available environment variable. | |||
| `ECS_IMAGE_CLEANUP_INTERVAL` | 30m | The time interval between automated image cleanup cycles. If set to less than 10 minutes, the value is ignored. | 30m | 30m | | |||
| `ECS_IMAGE_MINIMUM_CLEANUP_AGE` | 30m | The minimum time interval between when an image is pulled and when it can be considered for automated image cleanup. | 1h | 1h | | |||
| `ECS_NUM_IMAGES_DELETE_PER_CYCLE` | 5 | The maximum number of images to delete in a single automated image cleanup cycle. If set to less than 1, the value is ignored. | 5 | 5 | | |||
| `ECS_IMAGE_PULL_BEHAVIOR` | <default | always | once | prefer-cached > | The behavior used to customize the pull image process. If `default` is specified, the image will be pulled remotely, if the pull fails then the cached image will be used. If `always` is specified, the image will be pulled remotely, if the pull fails then the task will fail. If `once` is specified, the image will be pulled remotely if it has not been pulled before, otherwise the cached image will be used. If `prefer-cached` is specified, the image will be pulled remotely if there is no cached image, otherwise the cached image will be used. | default | default | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should also be reflected here, otherwise the description here is not correct and could cause confusion.
c7090f8
to
8b4cb17
Compare
Update: address @richardpen and @sharanyad 's comments. |
agent/engine/docker_image_manager.go
Outdated
@@ -155,6 +156,8 @@ func (imageManager *dockerImageManager) addContainerReferenceToNewImageState(con | |||
Image: sourceImage, | |||
PulledAt: time.Now(), | |||
LastUsedAt: time.Now(), | |||
// PullSucceeded should be set to true when the pull image succeeds. | |||
PullSucceeded: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool value is false
by default. This can be removed.
agent/engine/task_manager.go
Outdated
mtask.SetDesiredStatus(api.TaskStopped) | ||
return false | ||
} | ||
// If the agent pull behavior is prefer_cached, we receive the error because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer-cached
8b4cb17
to
a8a9b98
Compare
Summary
This PR is to fix issue #413, it introduces a new environment variable
ECS_AGENT_PULL_BEHAVIOR
to allow users to define different pull image behaviors, please see the design doc in the comments of issue #413, and usage of theECS_AGENT_PULL_BEHAVIOR
in README.md.Implementation details
PullSucceeded
inImageState
, it will be set totrue
when the pull succeeds.prefer-cached
behavior, check if there is cached image by inspecting image, if inspecting image fails, then pull image, otherwise just update container reference. Make task fail when pull fails.never
behavior, do nothing but update container reference when pulling container.once
behavior, check if the image has been pulled by inspecting theIsPullSuccess
of the image state, do nothing when the image has been pulled, otherwise pull as usual. If the pull image API call fails, make the task fail.Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes: yes, unit tests, integ tests and some manual tests.
Description for the changelog
Feature - Introduce a new environment variable
ECS_AGENT_PULL_BEHAVIOR
to make agent pull behavior configurableLicensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.