Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull retries and configurable routes for awsvpc network mode #975

Merged
merged 9 commits into from
Sep 13, 2017

Conversation

samuelkarp
Copy link
Contributor

@samuelkarp samuelkarp commented Sep 8, 2017

Summary

This series of commits adds retries for failed pulls (compensating for transient network issues) and allows additional routes to be configured for tasks launched with the awsvpc network mode.

Implementation details

Retries are implemented with the new utils.RetryNWithBackoffCtx (based on existing utils.RetryNWithBackoff) inside the dockerGoClient and excludes one expected non-transient error (missing IAM permissions to call ecr:GetAuthorizationToken). Other non-transient errors could be excluded from retries by extending CannotPullContainerError to implement utils.Retrier, but has not been done yet (I have not assembled a list of expected non-transient error codes from Docker).

Configurable routes were added by depending on the types.IPNet type from github.com/containernetworking/cni/pkg/types, which wraps net.IPNet so it can be serialized and deserialized in CIDR notation.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

Ran in an environment with manually configured additional routes and with transient network failures.

New tests cover the changes: yes

Description for the changelog

  • Enhancement - Retry failed container image pull operations.

Licensing

This contribution is under the terms of the Apache 2.0 License: yes

This change enables a static IPv4 address to be assigned to tasks with
the `awsvpc` network mode and also configures a return route for traffic
between the task and the host.
@samuelkarp samuelkarp force-pushed the retries-and-configurable-routes branch from 21c19d9 to e45a20b Compare September 8, 2017 22:44
@@ -17,6 +17,7 @@
package engine

import (
context0 "context"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the context package in my code.

I have no idea why golang.org/x/net/context is still imported here...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the golang.org/x/net/context is used by the task_engine.Init(), I think that's why the reason it's imported. I asked because I saw two context imported, we can't get rid of this until go docker client updated this.

"github.com/pkg/errors"
)

// CNIClient defines the method of setting/cleaning up container namespace
type CNIClient interface {
Version(string) (string, error)
Capabilities(string) ([]string, error)
SetupNS(*Config) error
CleanupNS(*Config) error
SetupNS(*Config, []cnitypes.IPNet) error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing an extra parameter, would it be better to add a new field in the Config struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do that if you feel strongly about it.


response := make(chan DockerContainerMetadata, 1)
go func() { response <- dg.pullImage(image, authData) }()
go func() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we distinguish the errors generated by the agent from the errors generated by docker, and we only retry for errors from docker? For the errors generated by the agent like CannotPullECRContainerError and CreateEmptyVolumeError probably won't recover with retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CannotPullECRContainerError is already excluded from retries. I can extend CreateEmptyVolumeError to exclude it from retries as well. The other errors wrapped by CannotPullContainerError are not excluded right now; I only covered ECR because it was easy.

@@ -2,6 +2,7 @@

## UNRELEASED
* Feature - Support for provisioning Tasks with ENIs
* Enhancement - Retry failed container image pull operations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also added another feature that add customized route in the container namespace for task using awsvpc. If I read the code correctly, this only can be configured by the config file, would it be more convenient to use environment variable. Also we may need to check the route before adding it, as the task traffic inside the container namespace can be changed by adding some route?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right; I had intended to make it configurable by environment variable as well but apparently never wrote the code. I'll do that.

I don't think we need to check the route. This is a sufficiently advanced, optional feature that is used in fairly narrow use-cases.

return nil
}
if imagePullError, ok := metadata.Error.(CannotPullContainerError); ok {
seelog.Warnf("When pulling %s: %s", container.Image, pullError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pullError and imagePullError aren't supposed to be distinct names. Can we fix this commit before the PR is merged? Either name is fine to use.

(I realize that this code doesn't exist as written in the final PR, but I do want bisects, etc, to work and we shouldn't have unbuildable code in the repo.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll fix this.

@@ -70,8 +70,8 @@ func New(p client.ConfigProvider, cfgs ...*aws.Config) *ECS {

// newClient creates, initializes and returns a new service client instance.
func newClient(cfg aws.Config, handlers request.Handlers, endpoint, signingRegion, signingName string) *ECS {
if signingName == "" {
signingName = ServiceName
if len(signingName) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I didn't make this change...this is what happened when I ran make gogenerate.

But the logic here is actually a little better; even though the zero-value of a string is "", this also would handle a *string where the zero-value is nil with no changes.

}

func (engine *DockerTaskEngine) buildCNIConfigFromTaskContainer(task *api.Task, container *api.Container) (*ecscni.Config, error) {
cfg, err := task.BuildCNIConfig()
if err != nil {
return nil, errors.Wrapf(err, "engine: build cni configuration from taskfailed")
}

if engine.cfg.OverrideAWSVPCLocalIPv4Address != nil &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]
consider moving this expression to a smaller function. ie, hasValidAWSVPCLocalOverrides(cfg *Config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to leave this as-is for now unless you feel strongly about it.

select {
case <-ctx.Done():
return nil
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!!

Why keep the default block empty? Would it make sense to put err = fn().. etc there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed clearer to me to have it out. This way it reads that we early return if the context is done, but otherwise keep going.

@samuelkarp samuelkarp force-pushed the retries-and-configurable-routes branch from e45a20b to 557658d Compare September 13, 2017 00:34
@samuelkarp
Copy link
Contributor Author

@richardpen @nmeyerhans I've addressed your comments, can you take a look?

@samuelkarp samuelkarp force-pushed the retries-and-configurable-routes branch from 557658d to cf49171 Compare September 13, 2017 00:40
Copy link

@richardpen richardpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also have you manually tested the route add part?

@@ -46,7 +46,7 @@ type NamedError interface {
ErrorName() string
}

// NamedError is a wrapper type for 'error' which adds an optional name and provides a symetric marshal/unmarshal
// NamedError is a wrapper type for 'error' which adds an optional name and provides a symmetric marshal/unmarshal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this comment should be moved above to the NamedError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to talking about DefaultNamedError, since that's the implementation that actually provides the marshal/unmarshal behavior.

@@ -218,7 +219,8 @@ func fileConfig() (Config, error) {

err = json.Unmarshal(data, &cfg)
if err != nil {
seelog.Errorf("Error reading cfg json data, err %v", err)
seelog.Criticalf("Error reading cfg json data, err %v", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth to logging the string(data) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not a bad idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further thought, I'm going to leave this out for now. I'm concerned about the case where a customer adds credentials (EngineAuthData) to the file and then we have a parse error; that would cause credentials to be written out to the log.

if additionalLocalRoutesEnv != "" {
err := json.Unmarshal([]byte(additionalLocalRoutesEnv), &additionalLocalRoutes)
if err != nil {
seelog.Warnf("Invalid format for ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES, expected a json array of CIDRs: %v", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, should we log the additionalLocalRoutesEnv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also going to leave this out. We don't log the read environment variable for any of the other ones that we parse. I'd rather leave this consistent for now and then revisit it later when we can change all of them.

os.Setenv("ECS_ENABLE_TASK_ENI", "true")
defer os.Unsetenv("ECS_ENABLE_TASK_ENI")
additionalLocalRoutesJSON := `["1.2.3.4/22","5.6.7.8/32"]`
os.Setenv("ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES", additionalLocalRoutesJSON)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you missed defer here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!

assert.Equal(t, map[string]string{"attribute1": "value1"}, cfg.InstanceAttributes)
}

// TestDockerAuthMergeFromFile tests docker auth read from file correctly after merge

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!

@@ -25,6 +25,8 @@ import (
)

func TestConfigDefault(t *testing.T) {
os.Setenv("AWS_DEFAULT_REGION", "foo-bar-1")
defer os.Unsetenv("AWS_DEFAULT")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: AWS_DEFAULT_REGION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!

@@ -64,10 +66,12 @@ func TestConfigDefault(t *testing.T) {
}

func TestConfigIAMTaskRolesReserves80(t *testing.T) {
os.Setenv("AWS_DEFAULT_REGION", "foo-bar-1")
defer os.Unsetenv("AWS_DEFAULT")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!

@samuelkarp
Copy link
Contributor Author

Also have you manually tested the route add part?

Yes, but in an earlier revision. I will re-run my tests with the latest changes.

README.md Outdated
| `ECS_ENABLE_TASK_ENI` | `false` | Whether to enable task networking for task to be launched with its own network interface | `false` | Not applicable |
| `ECS_CNI_PLUGINS_PATH` | `/ecs/cni` | The path where the cni binary file is located | `/amazon-ecs-cni-plugins` | Not applicable |
| `ECS_AWSVPC_BLOCK_IMDS` | `true` | Whether to block access to [Instance Metdata](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html) for Tasks started with `awsvpc` network mode | `false` | Not applicable |
| `ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES` | `["10.0.15.0/24"]` | Additional routes that cause traffic to travel by the default ENI | `[]` | Not applicable |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the name strongly implies it, the description should be explicit that this is used in awsvpc mode.

It's also not entirely accurate to say that traffic to these routes goes out via the default ENI; it may never leave the host at all. I might rewrite the description as

In awsvpc network mode, traffic to these prefixes will be routed via the host bridge instead of the task ENI

or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use your description.

if additionalLocalRoutesEnv != "" {
err := json.Unmarshal([]byte(additionalLocalRoutesEnv), &additionalLocalRoutes)
if err != nil {
seelog.Warnf("Invalid format for ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES, expected a json array of CIDRs: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is warn the right severity here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using warn for all of them right now. I'd be happy to change all of them to error if you'd like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we're using warn for both non-fatal conditions (actually should be warnings) and for parsing errors (should be errors). I'll change the ones that actually cause the agent to exit to error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's pretty much what I was getting at. If NewConfig() returns an error, agent panics. We should probably log the condition that cause the panic at an appropriately high severity.

@samuelkarp samuelkarp force-pushed the retries-and-configurable-routes branch from cf49171 to 2bc5d66 Compare September 13, 2017 17:57
@samuelkarp
Copy link
Contributor Author

@richardpen @nmeyerhans I believe I have addressed your comments, including additional manual testing. Can you take a look?

Copy link
Contributor

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I've got minor questions comments.

err = json.Unmarshal([]byte(instanceAttributesEnv), &instanceAttributes)
if instanceAttributesEnv != "" {
if err != nil {
err := fmt.Errorf("Invalid format for ECS_INSTANCE_ATTRIBUTES. Expected a json hash")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log/wrap the original error as well?

@@ -26,7 +27,11 @@ func TestSetupNS(t *testing.T) {
mockResult.EXPECT().String().Return(""),
)

err := ecscniClient.SetupNS(&Config{})
additionalRoutesJson := `["169.254.172.1/32", "10.11.12.13/32"]`
additionalRoutes := &[]cnitypes.IPNet{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this reads weird to me. Can it be just var AdditionalLocalRoutes []cnitypes.IPNet{} here instead and you can unmarshal it as &additionalRoutes?


retriableErr, isRetriableErr := err.(Retriable)

if err == nil || isRetriableErr && !retriableErr.Retry() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use parenthesis here to illustrate/enforce the operator precedence? Otherwise, readers like me will always have to look up the operator precedence documentation for how this is evaluated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just short-circuiting evaluation; any time a left-condition evaluates false, nothing proceeds.

Are you asking me to change this to err == nil || (isRetriableErr && !retriableErr.Retry())? That's logically equivalent, but (to me) harder to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking me to change this to err == nil || (isRetriableErr && !retriableErr.Retry())? That's logically equivalent, but (to me) harder to read.

Yeah, I prefer that. Because, then I wouldn't have to wonder if isRetriableErr is associated with the || or && operator.


// OverrideAWSVPCLocalIPv4Address overrides the local IPv4 address chosen
// for a task using the `awsvpc` networking mode. Using this configuration
// will limit you to running one `awsvpc` task at a time. IPv4 addresses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this configuration will limit you to running one awsvpc task

Since we don't enforce this anywhere in the code (iiuc), may be we should add a TODO so that we don't lose sight of this? Even an issue would be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want a TODO even if we don't have any short-term plans to change it? If you statically-assign an IP address, you'll always be limited to one or else you'll have IP address conflicts and things won't work right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, may be the better thing to do here is to update the description for the ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES field in the README file with some of what you have here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES won't limit you to running one awsvpc task; it just adds routes. It's specifically the static IP assignment that is limiting here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, OverrideAWSVPCLocalIPv4Address gets set just from the JSON config and nowhere else? If that's the case, it's probably fine to leave it be. I'm just looking for ways to highlight this behavior so that it doesn't lead to post start-task failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. The intent here is that this is meant to serve a very limited set of use-cases and is not really a generally useful feature, so it's excluded from the more-common way to configure the agent (environment variables).

Copy link
Contributor

@nmeyerhans nmeyerhans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:

We're going to need an IPv6 equivalent of ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES at some point...

@samuelkarp samuelkarp force-pushed the retries-and-configurable-routes branch from 2bc5d66 to 4ebfefe Compare September 13, 2017 18:43
@samuelkarp
Copy link
Contributor Author

@aaithal I've addressed your comments, except for the one about OverrideAWSVPCLocalIPv4Address. Let me know if you want me to make any additional changes.

@samuelkarp samuelkarp merged commit 4ebfefe into aws:dev Sep 13, 2017
@jhaynes jhaynes mentioned this pull request Sep 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants