-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle SetReadDeadline error for websocket connections (#1292) #1310
Conversation
4.9 reached EOL in 2016 and is not available on Dockerhub anymore. Update the base image to 6.4 to fix this.
We're at least 2 years behind in this repo. Updated gomock to latest commit.
[1] Forcefully terminate connection if there's any error with the SetReadDeadline() call. Not doing this leads to unpredictable behavior with stale websocket connections. Since SetWriteDeadline always returns `nil`, errors from that are not handled. [2] Move WebsocketConn to its own package. This allows us to mock it out in the "wsclient" package and write unit tests on it.
Invoke 'goimports' while generating sdk artifacts. Generated targets were missing some imports and this fixes that issue. Also, refactored the code a bit to reduce duplication.
Added changelog entry for stale websocket connection bug fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm, just few questions.
} | ||
// An unhandled error has occurred while trying to extend read deadline. | ||
// Try asynchronously closing the connection. We don't want to be blocked on stale connections | ||
// taking too long to close. The flip side is that we might start accumulating stale connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mainly for my understanding - what exactly happens if we "start accumulating stale connections"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point of time, the OS is going to stop the ECS agent from creating new connections if there's a chronic issue because of file descriptor limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that? Do we have a test that ensures that the expected behavior happens when we hit the limit?
agent/wsclient/client.go
Outdated
seelog.Warnf("Unable to close websocket connection: %v for %s", | ||
closeErr, cs.URL) | ||
} | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, mainly for my understanding. what exactly happens if we're "unable to close websocket connection" here? seems like we just move along. what side effects in the agent should we expect if we're unable to close lots of these websocket connections here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above. lmk if you have more questions.
@@ -429,6 +429,7 @@ func newDisconnectionTimer(client wsclient.ClientServer, timeout time.Duration, | |||
if err := client.Close(); err != nil { | |||
seelog.Warnf("Error disconnecting: %v", err) | |||
} | |||
seelog.Info("Disconnected from ACS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think copyright line in this file needs updating too.
@@ -11,7 +11,7 @@ | |||
# express or implied. See the License for the specific language governing | |||
# permissions and limitations under the License. | |||
|
|||
FROM gcc:4.9 | |||
FROM gcc:6.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use the latest here, in case we will need to update again in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to use latest
as it makes it hard to audit and track changes
agent/wsclient/client.go
Outdated
}() | ||
ctx, cancel := context.WithTimeout(context.TODO(), wsConnectTimeout) | ||
defer cancel() | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this doesn't need to be in the for loop, as both of the cases will return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. will modify that
agent/wsclient/client_test.go
Outdated
@@ -36,6 +40,11 @@ import ( | |||
|
|||
const dockerEndpoint = "/var/run/docker.sock" | |||
|
|||
// Close closes the underlying connection | |||
func (cs *ClientServerImpl) Close() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this implemented in the test file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClientServerImpl
is a partial implementation of the interface, which lacks the Close
method. If this is not there, there'll be NPEs. I can document that to make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, and good improvement to the test structure!
agent/wsclient/client.go
Outdated
}() | ||
ctx, cancel := context.WithTimeout(context.TODO(), wsConnectTimeout) | ||
defer cancel() | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a loop if this only runs once? Select blocks forever and both of the cases return.
seelog.Errorf("Stopping redundant reads on closed network connection: %s", cs.URL) | ||
return opErr | ||
} | ||
// An unhandled error has occurred while trying to extend read deadline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things stick out to me here. They're stylistic things, so FFTI:
- This doesn't follow the standard
err != nil
pattern where only good things happen on the left side. - In my opinion, all of the second half of this should be in a separate method from
SetReadDeadline
.
Lets say you extracted the server close logic into closeCSAsync
. I'd maybe go with a structure like this:
err := cs.Conn.SetReadDeadline()
if err != nil {
// check for opErr
// return opErr
servErr := closeCSAsync()
// check for servErr
}
agent/wsclient/client.go
Outdated
seelog.Warnf("Context canceled waiting for termination of websocket connection: %v for %s", | ||
ctx.Err(), cs.URL) | ||
} | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Its a bit annoying to figure out what err is all the way down here. Maybe make it clear which err is getting returned in a comment? (Its the original deadline err, right?)
58b68d2
to
4e527e0
Compare
Added changelog entry for stale websocket connection bug fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for mock updates!
@@ -60,6 +63,8 @@ const ( | |||
|
|||
// Default NO_PROXY env var IP addresses | |||
defaultNoProxyIP = "169.254.169.254,169.254.170.2" | |||
|
|||
errClosed = "use of closed network connection" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have the error name more descriptive like errClosedConnection
or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errClosed
is pretty descriptive, no? I can change it you have a strong preference
seelog.Warnf("Unable to set read deadline for websocket connection: %v for %s", err, cs.URL) | ||
// If we get connection closed error from SetReadDeadline, break out of the for loop and | ||
// return an error | ||
if opErr, ok := err.(*net.OpError); ok && strings.Contains(opErr.Err.Error(), errClosed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do a cs.Close()
somewhere in this case too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so since it's already closed.
Added missing copyright headers for a couple of files
4e527e0
to
8d11bdb
Compare
|
||
gomock "github.com/golang/mock/gomock" | ||
) | ||
|
||
// Mock of FileSystem interface | ||
// MockFileSystem is a mock of FileSystem interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: // MockFileSystem is a mock of the FileSystem interface
api "github.com/aws/amazon-ecs-agent/agent/api" | ||
ecs "github.com/aws/amazon-ecs-agent/agent/ecs_client/model/ecs" | ||
gomock "github.com/golang/mock/gomock" | ||
) | ||
|
||
// Mock of ECSSDK interface | ||
// MockECSSDK is a mock of ECSSDK interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: *the ECSSDK interface
} | ||
// An unhandled error has occurred while trying to extend read deadline. | ||
// Try asynchronously closing the connection. We don't want to be blocked on stale connections | ||
// taking too long to close. The flip side is that we might start accumulating stale connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that? Do we have a test that ensures that the expected behavior happens when we hit the limit?
Summary
Added error handling for
SetReadDeadline()
. This should take care or scenarios where stale websocket connections linger (#1292)Implementation details
Forcefully terminate connection if there's any error with the
SetReadDeadline() call. Not doing this leads to unpredictable behavior
with stale websocket connections.
Since SetWriteDeadline always returns
nil
, errors from that are nothandled.
Move WebsocketConn to its own package. This allows us to mock it
out in the "wsclient" package and write unit tests on it.
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes: Yes
Description for the changelog
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.