-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eni watcher: add backoff-retry #1148
Conversation
This commit adds back-off retry logic in: * udev watcher: It takes a few milliseconds for the ENI's mac address to show up on the instance once udev watcher detects that a new device has been attached * eni state lookup: If for whatever reason, there's a delay in ACS sending the ENI attachment message to the agent, the agent's state remains unaware of the ENI This should lead to improved task start latencies of awsvpc tasks.
1f3855f
to
64377a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments.
if err != nil { | ||
return "", err | ||
} | ||
return link.Attrs().HardwareAddr.String(), err | ||
// RetryWithBackoffCtx returns nil when the context is cancelled. Check if there was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to check retriever.macAddress != ""
, though I don't strongly opposite this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't know if it gives us any additional benefits. i like dealing with errors rather than empty strings or nils. So, will keep it this way
@@ -57,6 +88,17 @@ type UdevWatcher struct { | |||
primaryMAC string | |||
} | |||
|
|||
// unmanagedENIError is used to indicate that the agent found an ENI, but the agent isn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved to another file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we may need a type.go
for the struct and const, to split this file into smaller pieces which are easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not a fan of the type.go
model. It's much more easier to navigate/organize the code where the struct and the methods are defined in the same file. We could move this into an errors.go
file. But, seeing how it's just one type, I'd rather keep it here than to create a new file for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm, just few questions.
) | ||
|
||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how exactly did we determine the durations for these backoff times? same with the eni ones below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with experiments. It took an average of 10ms for the ENI's mac address to show up on the instance. The one for attachment message was based on my conversations with the ECS backend team.
splitDevLink := strings.SplitN(devicePath, "devices/", 2) | ||
if len(splitDevLink) != 2 { | ||
log.Warnf("Cannot determine device validity: %s", devicePath) | ||
seelog.Warnf("Cannot determine device validity: %s", devicePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not clear to me why len(splitDevLink) != 2
means we cannot determine device validity. can we add some documentation? should we add the splitDevLink
to log output as well? although, maybe i'm just missing context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example entries in the comment above are meant to illustrate that:
eth1 -> /devices/pci0000:00/0000:00:05.0/net/eth1
This means that we expect it to be of "devices/some-string" format. Is that not clear enough? It's just additional validation
|
||
// Error returns the error string for the unmanagedENIError type | ||
func (err *unmanagedENIError) Error() string { | ||
return fmt.Sprintf("udev watcher send ENI state change: eni not managed by ecs: %s", err.mac) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear to me what we mean by udev watcher send ENI state change
. the second part of the error message is fine though.
@@ -1,4 +1,4 @@ | |||
// Copyright 2014-2015 Amazon.com, Inc. or its affiliates. All Rights Reserved. | |||
// Copyright 2014-2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for documenting all these =]
dev := filepath.Base(retriever.dev) | ||
link, err := retriever.netlinkClient.LinkByName(dev) | ||
if err != nil { | ||
return utils.NewRetriableError(utils.NewRetriable(false), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that we should never retry in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, an error means a syscall failed for describing the device. at this point of time, it doesn't make sense to retry it as it really should be visible on the instance
@@ -4,6 +4,7 @@ | |||
* Feature - Support a HTTP endpoint for `awsvpc` tasks to query metadata | |||
* Bug - Fixed a bug where `-version` fails due to its dependency on docker client. [#1118](https://github.com/aws/amazon-ecs-agent/pull/1118) | |||
* Bug - Persist container exit code in agent state file [#1125](https://github.com/aws/amazon-ecs-agent/pull/1125) | |||
* Bug - Fixed a bug where the agent could miss sending an ENI attachment to ECS because of address propagation delays [#1148](https://github.com/aws/amazon-ecs-agent/pull/1148) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you wrap this at 80 characters?
|
||
// macAddressBackoffJitter specifies the jitter multiple percentage when | ||
// looking for an ENI's mac address on the host | ||
macAddressBackoffJitter = 0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think jitter is really necessary here since you're not making remote calls or accessing a resource with high concurrency. On the other hand, I don't think it really hurts either except to make the timing a bit more unpredictable.
if err != nil { | ||
return "", err | ||
} | ||
return link.Attrs().HardwareAddr.String(), err | ||
// RetryWithBackoffCtx returns nil when the context is cancelled. Check if there was | ||
// a timeout here. TODO: Fix RetryWithBackoffCtx to return ctx.Err() on context Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I think that's on me...
closing this as this is already merged to the master branch with 1.16.1 release |
Summary
eni watcher: add backoff-retry
Implementation details
show up on the instance once udev watcher detects that a new device has
been attached
sending the ENI attachment message to the agent, the agent's state
remains unaware of the ENI
This should lead to improved task start latencies of awsvpc tasks.
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passAttachTaskNetworkInterfacesMessage
andPayloadMessage
without this change: 23sAttachTaskNetworkInterfacesMessage
andPayloadMessage
with this change: 11sNew tests cover the changes: yes
Description for the changelog
Licensing
This contribution is under the terms of the Apache 2.0 License: yes