Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eni watcher: add backoff-retry #1148

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
* Feature - Support a HTTP endpoint for `awsvpc` tasks to query metadata
* Bug - Fixed a bug where `-version` fails due to its dependency on docker client. [#1118](https://github.com/aws/amazon-ecs-agent/pull/1118)
* Bug - Persist container exit code in agent state file [#1125](https://github.com/aws/amazon-ecs-agent/pull/1125)
* Bug - Fixed a bug where the agent could miss sending an ENI attachment to ECS because of address propagation delays [#1148](https://github.com/aws/amazon-ecs-agent/pull/1148)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you wrap this at 80 characters?


## 1.16.0
* Feature - Support pulling from Amazon ECR with specified IAM role in task definition
Expand Down
109 changes: 100 additions & 9 deletions agent/eni/networkutils/utils_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,22 +17,113 @@
package networkutils

import (
"context"
"path/filepath"
"strings"
"time"

"github.com/aws/amazon-ecs-agent/agent/eni/netlinkwrapper"
"github.com/aws/amazon-ecs-agent/agent/utils"
"github.com/pkg/errors"

log "github.com/cihub/seelog"
"github.com/cihub/seelog"
)

const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how exactly did we determine the durations for these backoff times? same with the eni ones below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with experiments. It took an average of 10ms for the ENI's mac address to show up on the instance. The one for attachment message was based on my conversations with the ECS backend team.

// macAddressBackoffMin specifies the mimimum duration for the backoff
// when looking for an ENI's mac address on the host
macAddressBackoffMin = 2 * time.Millisecond

// macAddressBackoffMax specifies the maximum duration for the backoff
// when looking for an ENI's mac address on the host
macAddressBackoffMax = 200 * time.Millisecond

// macAddressBackoffJitter specifies the jitter multiple percentage when
// looking for an ENI's mac address on the host
macAddressBackoffJitter = 0.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think jitter is really necessary here since you're not making remote calls or accessing a resource with high concurrency. On the other hand, I don't think it really hurts either except to make the timing a bit more unpredictable.


// macAddressBackoffMultiple specifies the backoff duration multiplier
// when looking for an ENI's mac address on the host
macAddressBackoffMultiple = 1.5
)

// macAddressRetriever is used to retrieve the mac address of a device. It collects
// all the information necessary to start this operation and stores the result in
// the 'macAddress' attribute
type macAddressRetriever struct {
dev string
netlinkClient netlinkwrapper.NetLink
macAddress string
// timeout specifies the timeout duration before giving up when
// looking for an ENI's mac address on the host
timeout time.Duration
ctx context.Context
}

// GetMACAddress retrieves the MAC address of a device using netlink
func GetMACAddress(dev string, netlinkClient netlinkwrapper.NetLink) (string, error) {
dev = filepath.Base(dev)
link, err := netlinkClient.LinkByName(dev)
func GetMACAddress(ctx context.Context,
timeout time.Duration,
dev string,
netlinkClient netlinkwrapper.NetLink) (string, error) {
retriever := &macAddressRetriever{
dev: dev,
netlinkClient: netlinkClient,
ctx: ctx,
timeout: timeout,
}
return retriever.retrieve()
}

// retrieve retrives the mac address of a network device. If the retrieved mac
// address is empty, it retries the operation with a timeout specified by the
// caller
func (retriever *macAddressRetriever) retrieve() (string, error) {
backoff := utils.NewSimpleBackoff(macAddressBackoffMin, macAddressBackoffMax,
macAddressBackoffJitter, macAddressBackoffMultiple)
ctx, cancel := context.WithTimeout(retriever.ctx, retriever.timeout)
defer cancel()

err := utils.RetryWithBackoffCtx(ctx, backoff, func() error {
retErr := retriever.retrieveOnce()
if retErr != nil {
seelog.Warnf("Unable to retrieve mac address for device '%s': %v",
retriever.dev, retErr)
return retErr
}

if retriever.macAddress == "" {
seelog.Debugf("Empty mac address for device '%s'", retriever.dev)
// Return a retriable error when mac address is empty. If the error
// is not wrapped with the RetriableError interface, RetryWithBackoffCtx
// treats them as retriable by default
return errors.Errorf("eni mac address: retrieved empty address for device %s",
retriever.dev)
}

return nil
})
if err != nil {
return "", err
}
return link.Attrs().HardwareAddr.String(), err
// RetryWithBackoffCtx returns nil when the context is cancelled. Check if there was

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to check retriever.macAddress != "", though I don't strongly opposite this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't know if it gives us any additional benefits. i like dealing with errors rather than empty strings or nils. So, will keep it this way

// a timeout here. TODO: Fix RetryWithBackoffCtx to return ctx.Err() on context Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I think that's on me...

if err = ctx.Err(); err != nil {
return "", errors.Wrapf(err, "eni mac address: timed out waiting for eni device '%s'",
retriever.dev)
}

return retriever.macAddress, nil
}

// retrieveOnce retrieves the MAC address of a device using netlink.LinkByName
func (retriever *macAddressRetriever) retrieveOnce() error {
dev := filepath.Base(retriever.dev)
link, err := retriever.netlinkClient.LinkByName(dev)
if err != nil {
return utils.NewRetriableError(utils.NewRetriable(false), err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that we should never retry in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, an error means a syscall failed for describing the device. at this point of time, it doesn't make sense to retry it as it really should be visible on the instance

}
retriever.macAddress = link.Attrs().HardwareAddr.String()
return nil
}

// IsValidNetworkDevice is used to differentiate virtual and physical devices
Expand All @@ -43,23 +134,23 @@ func IsValidNetworkDevice(devicePath string) bool {
* eth1 -> /devices/pci0000:00/0000:00:05.0/net/eth1
* eth0 -> ../../devices/pci0000:00/0000:00:03.0/net/eth0
* lo -> ../../devices/virtual/net/lo
*/
*/
splitDevLink := strings.SplitN(devicePath, "devices/", 2)
if len(splitDevLink) != 2 {
log.Warnf("Cannot determine device validity: %s", devicePath)
seelog.Warnf("Cannot determine device validity: %s", devicePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not clear to me why len(splitDevLink) != 2 means we cannot determine device validity. can we add some documentation? should we add the splitDevLink to log output as well? although, maybe i'm just missing context.

Copy link
Contributor Author

@aaithal aaithal Dec 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example entries in the comment above are meant to illustrate that:

eth1 -> /devices/pci0000:00/0000:00:05.0/net/eth1

This means that we expect it to be of "devices/some-string" format. Is that not clear enough? It's just additional validation

return false
}
/*
* CoreOS typically employs the vif style for physical net interfaces
* Amazon Linux, Ubuntu, RHEL, Fedora, Suse use the traditional pci convention
*/
*/
if strings.HasPrefix(splitDevLink[1], pciDevicePrefix) || strings.HasPrefix(splitDevLink[1], vifDevicePrefix) {
return true
}
if strings.HasPrefix(splitDevLink[1], virtualDevicePrefix) {
return false
}
// NOTE: Should never reach here
log.Criticalf("Failed to validate device path: %s", devicePath)
seelog.Criticalf("Failed to validate device path: %s", devicePath)
return false
}
54 changes: 52 additions & 2 deletions agent/eni/networkutils/utils_linux_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,11 @@
package networkutils

import (
"context"
"errors"
"net"
"testing"
"time"

"github.com/golang/mock/gomock"
"github.com/stretchr/testify/assert"
Expand Down Expand Up @@ -49,7 +51,8 @@ func TestGetMACAddress(t *testing.T) {
Name: randomDevice,
},
}, nil)
mac, err := GetMACAddress(randomDevice, mockNetlink)
ctx := context.TODO()
mac, err := GetMACAddress(ctx, time.Millisecond, randomDevice, mockNetlink)
assert.Nil(t, err)
assert.Equal(t, mac, validMAC)
}
Expand All @@ -60,10 +63,57 @@ func TestGetMACAddressWithNetlinkError(t *testing.T) {
mockCtrl := gomock.NewController(t)
defer mockCtrl.Finish()
mockNetlink := mock_netlinkwrapper.NewMockNetLink(mockCtrl)
// LinkByName returns an error. This will ensure that a non-retriable
// error is generated and halts the backoff-rety loop
mockNetlink.EXPECT().LinkByName(randomDevice).Return(
&netlink.Device{},
errors.New("Dummy Netlink Error"))
mac, err := GetMACAddress(randomDevice, mockNetlink)
ctx := context.TODO()
mac, err := GetMACAddress(ctx, 2*macAddressBackoffMax, randomDevice, mockNetlink)
assert.Error(t, err)
assert.Empty(t, mac)
}

// TestGetMACAddressNotFoundRetry tests if netlink.LinkByName gets invoked
// multiple times if the mac address for a device is returned as empty string
func TestGetMACAddressNotFoundRetry(t *testing.T) {
mockCtrl := gomock.NewController(t)
defer mockCtrl.Finish()
mockNetlink := mock_netlinkwrapper.NewMockNetLink(mockCtrl)
pm, _ := net.ParseMAC(validMAC)
gomock.InOrder(
// Return empty mac address on first invocation
mockNetlink.EXPECT().LinkByName(randomDevice).Return(&netlink.Device{}, nil),
// Return a valid mac address on first invocation. Even though the first
// invocation did not result in an error, it did return an empty mac address.
// Hence, we expect it to be retried
mockNetlink.EXPECT().LinkByName(randomDevice).Return(&netlink.Device{
LinkAttrs: netlink.LinkAttrs{
HardwareAddr: pm,
Name: randomDevice,
},
}, nil),
)
ctx := context.TODO()
// Set max retry duration to twice that of the min backoff to ensure that there's
// enough time to retry
mac, err := GetMACAddress(ctx, 2*macAddressBackoffMin, randomDevice, mockNetlink)
assert.NoError(t, err)
assert.Equal(t, mac, validMAC)
}

// TestGetMACAddressNotFoundRetryExpires tests if the backoff-retry logic for
// retrieving mac address returns error on timeout
func TestGetMACAddressNotFoundRetryExpires(t *testing.T) {
mockCtrl := gomock.NewController(t)
defer mockCtrl.Finish()
mockNetlink := mock_netlinkwrapper.NewMockNetLink(mockCtrl)
// Return empty mac address everytime
mockNetlink.EXPECT().LinkByName(randomDevice).Return(&netlink.Device{}, nil).MinTimes(1)
ctx := context.TODO()
// Set max retry duration to twice that of the min backoff to ensure that there's
// enough time to retry
mac, err := GetMACAddress(ctx, 2*macAddressBackoffMin, randomDevice, mockNetlink)
assert.Error(t, err)
assert.Empty(t, mac)
}
Expand Down
Loading