Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket timeout #993

Merged
merged 4 commits into from
Sep 28, 2017
Merged

Websocket timeout #993

merged 4 commits into from
Sep 28, 2017

Conversation

aaithal
Copy link
Contributor

@aaithal aaithal commented Sep 27, 2017

Summary

Implements some websocket connection management improvements.

Implementation details

  1. Set read and write deadlines for websocket ReadMessage and
    WriteMessage operations. This is to ensure that these methods
    do not hang and result in io timeout if there's issues with
    the connection
  2. Reduce the scope of the lock in the Connect() method. The
    lock was being held for the length of Connect() method, which
    meant that it wouldn't be relnquished if there was any delay
    in establishing the connection. The scope of the lock has now
    been reduced to just accessing the cs.conn variable
  3. Start ACS heartbeat timer after the connection has been
    established. The timer was being started before a call to
    Connect, which meant that the connection could be prematurely
    terminated for being idle if there was a delay in establishing
    the connection

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass
  • Manual tests
    • Reconnects to ACS and TCS work as expected when ACS/TCS initiate disconnect
    • Reconnects to ACS and TCS work as expected when Agent initiates disconnect

New tests cover the changes: Yes

Description for the changelog

Covered in changelog.md

Licensing

This contribution is under the terms of the Apache 2.0 License: Yes

The wc command prefixes spaces to the output, which
corrupts the GIT_PORCELAIN variable, thus failing
the build. This change removes spaces from the output.
This commit aims to make the websocker connection management
better by implementing the following improvements:

1. Set read and write deadlines for websocket ReadMessage and
WriteMessage operations. This is to ensure that these methods
do not hang and result in io timeout if there's issues with
the connection
2. Reduce the scope of the lock in the Connect() method. The
lock was being held for the length of Connect() method, which
meant that it wouldn't be relnquished if there was any delay
in establishing the connection. The scope of the lock has now
been reduced to just accessing the cs.conn variable
3. Start ACS heartbeat timer after the connection has been
established. The timer was being started before a call to
Connect, which meant that the connection could be prematurely
terminated for being idle if there was a delay in establishing
the connection

These changes should improve the disconnection behavior of the
websocket connection, which should help with scenarios where the
Agent never reconnects to ACS because it's forever waiting in
Disconnect() method waiting to acquire the lock (aws#985)
seelog.Debug("ACS activity occured")
seelog.Debug("ACS activity occurred")
// Reset read deadline as there's activity on the channel
if err := client.SetReadDeadline(time.Now().Add(heartbeatTimeout + heartbeatJitter)); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel more comfortable if this were actually longer, like 2*heartbeatTimeout + heartbeatJitter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain again here too? Doesn't doubling that have the net effect of doubling your pre-established heartbeatTimeout the first time there is some traffic over the websocket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we use rand.Int63n() in AddJitter, heartbeatTimeout + heartbeatJitter should always be greater than the heartbeat duration that we compute. I think it should be safe to set it the way it's currently being set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't doubling that have the net effect of doubling your pre-established heartbeatTimeout the first time there is some traffic over the websocket?

No, the deadline is set based on time.Now(), so it's always in the future by the amount specified here.

I think it should be safe to set it the way it's currently being set.

I'm worried, because this is a time and not a duration, and because Go 1.7 doesn't have a monotonic clock, that something unexpected can happen here. I'd feel more comfortable with a longer timeout because we don't expect this timeout to be hit anyway and I don't see any danger of making it longer.

Once we move to Go 1.9 there will be a monotonic clock and this should be safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel more comfortable with a longer timeout because we don't expect this timeout to be hit anyway

Oh, you'd be surprised :) I think we'll hit this timeout when there's connection issues. However, I do buy the argument that 2*heartbeatTimeout + heartbeatJitter will work for most cases here.

seelog.Warn("Unable to extend read deadline for ACS connection: %v", err)
}

// Reset heearbeat timer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: heartbeat

@@ -35,8 +36,8 @@ const (
defaultPublishMetricsInterval = 20 * time.Second

// The maximum time to wait between heartbeats without disconnecting
defaultHeartbeatTimeout = 5 * time.Minute
defaultHeartbeatJitter = 3 * time.Minute
defaultHeartbeatTimeout = 1 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is just meant to mimic the work that @richardpen did for the ACS client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelkarp @jhaynes is correct. This is reflecting the change that @richardpen did.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

return func(interface{}) {
seelog.Trace("TCS activity occured")
// Reset read deadline as there's activity on the channel
if err := client.SetReadDeadline(time.Now().Add(defaultHeartbeatTimeout + defaultHeartbeatJitter)); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I think this should be longer than just timeout + jitter.

@@ -377,7 +376,7 @@ func (acsSession *session) heartbeatJitter() time.Duration {

// createACSClient creates the ACS Client using the specified URL
func (acsResources *acsSessionResources) createACSClient(url string, cfg *config.Config) wsclient.ClientServer {
return acsclient.New(url, cfg, acsResources.credentialsProvider)
return acsclient.New(url, cfg, acsResources.credentialsProvider, heartbeatTimeout+heartbeatJitter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be longer than timeout + jitter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on why? I would assume you'd want it to be smaller if anything. That way, you could retry a read/write without necessarily having to re-establish the whole connection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will respond on the other comment.

deregisterInstanceEventStream *eventstream.EventStream) error {
client := tcsclient.New(url, cfg, credentialProvider, statsEngine, publishMetricsInterval)
client := tcsclient.New(url, cfg, credentialProvider, statsEngine,
publishMetricsInterval, defaultHeartbeatTimeout+defaultHeartbeatJitter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be longer than timeout + jitter.

seelog.Debug("ACS activity occurred")
// Reset read deadline as there's activity on the channel
if err := client.SetReadDeadline(time.Now().Add(heartbeatTimeout + heartbeatJitter)); err != nil {
seelog.Warn("Unable to extend read deadline for ACS connection: %v", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Warnf

@@ -112,7 +118,7 @@ func startSession(url string, cfg *config.Config, credentialProvider *credential
timer := time.AfterFunc(utils.AddJitter(heartbeatTimeout, heartbeatJitter), func() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do the same as acs to move the timer after the connection was established?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought I did that. But, obviously didn't. I'll modify this.

// connection is active
func anyMessageHandler(client wsclient.ClientServer) func(interface{}) {
return func(interface{}) {
seelog.Trace("TCS activity occured")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Trace level here intentionally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The similar ACS log line is Debug level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TCS channel is more chatty, which means we'll see this printed every 20 seconds if this is changed to Debug. I could get rid of this altogether if you have a strong preference here. But, somehow this Trace soothes my mind.

@@ -48,6 +49,9 @@ const (
// wsConnectTimeout specifies the default connection timeout to the backend.
wsConnectTimeout = 30 * time.Second

// wsHandshakeTimeout specifies the default handshake timeout for the websocket client
wsHandshakeTimeout = wsConnectTimeout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have wsHandshakeTimeout, do we still need wsConnectTimeout? As if the hand shake timed out, the connect will fail, is that right? Or they should be the same value, eg: wsHandshakeTimeout = 3*wsConnectTimeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume the wsHandshakeTimeout would be <= wsConectTimeout

// SetReadDeadline sets the read deadline for the websocket connection
// A read timeout results in an io error if there are any outstanding reads
// that exceed the deadline
func (cs *ClientServerImpl) SetReadDeadline(t time.Time) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to pass the duration and calculate the time right before calling the SetReadDeadline, it's more accurate, as the passed in time could already beyond time.Now() for some wired reason.

cs := &clientServer{}
cs.URL = url
cs.CredentialProvider = credentialProvider
cs.AgentConfig = cfg
cs.ServiceError = &acsError{}
cs.RequestHandlers = make(map[string]wsclient.RequestHandler)
cs.TypeDecoder = NewACSDecoder()
cs.RWTimeout = rwTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
why not just 'timeout'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted it to convey more meaning that just timeout would convey.

closeErr := client.Close()
if closeErr != nil {
seelog.Warnf("Error disconnecting: %v", closeErr)
if err := client.Close(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one return value.. so is this still more idiomatic than:
if client.Close() != nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is that? I don't see how err := something; if err != nil {} is more idiomatic than if err := something; err != nil {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! I was asking if setting the err variable is necessary at all. On closer inspection I see that we log the error, so you should ignore this :)

return func(interface{}) {
seelog.Debug("ACS activity occured")
seelog.Debug("ACS activity occurred")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this debug line useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so. Shows us when the timer was reset.

timer := newDisconnectionTimer(mockWsClient, acsSession.heartbeatTimeout(), acsSession.heartbeatJitter())
defer timer.Stop()
acsSession.startACSSession(mockWsClient, timer)
acsSession.startACSSession(mockWsClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the function wrapper
go acsSession.startACSSession(mockWsClient)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

@@ -35,8 +36,8 @@ const (
defaultPublishMetricsInterval = 20 * time.Second

// The maximum time to wait between heartbeats without disconnecting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we go ahead and add the default rwtimeout here too? even if expresed as:

defaultRWTimeout = defaultHeartBeatJitter + defaultHeartbeatTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about and didn't think we needed a new const for this. But, I think it makes it more explicit. I'll add that.

"github.com/aws/amazon-ecs-agent/agent/wsclient"
"github.com/aws/aws-sdk-go/aws/credentials"
"github.com/cihub/seelog"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -377,7 +376,7 @@ func (acsSession *session) heartbeatJitter() time.Duration {

// createACSClient creates the ACS Client using the specified URL
func (acsResources *acsSessionResources) createACSClient(url string, cfg *config.Config) wsclient.ClientServer {
return acsclient.New(url, cfg, acsResources.credentialsProvider)
return acsclient.New(url, cfg, acsResources.credentialsProvider, heartbeatTimeout+heartbeatJitter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on why? I would assume you'd want it to be smaller if anything. That way, you could retry a read/write without necessarily having to re-establish the whole connection.

seelog.Debug("ACS activity occured")
seelog.Debug("ACS activity occurred")
// Reset read deadline as there's activity on the channel
if err := client.SetReadDeadline(time.Now().Add(heartbeatTimeout + heartbeatJitter)); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain again here too? Doesn't doubling that have the net effect of doubling your pre-established heartbeatTimeout the first time there is some traffic over the websocket?

@@ -46,7 +46,12 @@ type clientServer struct {
// New returns a client/server to bidirectionally communicate with the backend.
// The returned struct should have both 'Connect' and 'Serve' called upon it
// before being used.
func New(url string, cfg *config.Config, credentialProvider *credentials.Credentials, statsEngine stats.Engine, publishMetricsInterval time.Duration) wsclient.ClientServer {
func New(url string,
cfg *config.Config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I much prefer this style for functions with more than a few arguments.

@@ -67,11 +73,11 @@ func New(url string, cfg *config.Config, credentialProvider *credentials.Credent
func (cs *clientServer) Serve() error {
seelog.Debug("TCS client starting websocket poll loop")
if !cs.IsReady() {
return fmt.Errorf("Websocket not ready for connections")
return fmt.Errorf("tcs client: websocket not ready for connections")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should TCS be capitalized here? We capitalize ACS in most places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're sticking to the style of error strings beginning with a lower case letter.

}

if cs.statsEngine == nil {
return fmt.Errorf("uninitialized stats engine")
return fmt.Errorf("tcs client: uninitialized stats engine")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above. Probably should be TCS.

@@ -35,8 +36,8 @@ const (
defaultPublishMetricsInterval = 20 * time.Second

// The maximum time to wait between heartbeats without disconnecting
defaultHeartbeatTimeout = 5 * time.Minute
defaultHeartbeatJitter = 3 * time.Minute
defaultHeartbeatTimeout = 1 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is just meant to mimic the work that @richardpen did for the ACS client.

@@ -112,7 +118,7 @@ func startSession(url string, cfg *config.Config, credentialProvider *credential
timer := time.AfterFunc(utils.AddJitter(heartbeatTimeout, heartbeatJitter), func() {
// Close the connection if there haven't been any messages received from backend
// for a long time.
seelog.Debug("TCS Connection hasn't had a heartbeat or an ack message in too long of a timeout; disconnecting")
seelog.Info("TCS Connection hasn't had a heartbeat or an ack message in too long of a timeout; disconnecting")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording is confusing. The similar ACS log message is: "ACS Connection hasn't had any activity for too long; closing connection"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll modify that.

// connection is active
func anyMessageHandler(client wsclient.ClientServer) func(interface{}) {
return func(interface{}) {
seelog.Trace("TCS activity occured")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The similar ACS log line is Debug level.

@@ -48,6 +49,9 @@ const (
// wsConnectTimeout specifies the default connection timeout to the backend.
wsConnectTimeout = 30 * time.Second

// wsHandshakeTimeout specifies the default handshake timeout for the websocket client
wsHandshakeTimeout = wsConnectTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume the wsHandshakeTimeout would be <= wsConectTimeout

@aaithal
Copy link
Contributor Author

aaithal commented Sep 28, 2017

@samuelkarp @jhaynes @richardpen @petderek I've modified the code to address your comments. Can you please take a look?

Copy link

@richardpen richardpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment


// This is just future proofing. Ignore the error as the gorilla websocket
// library returns 'nil' anyway for SetWriteDeadline
// https://github.com/gorilla/websocket/blob/master/conn.go#L761

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if you could change this link to reference a specific commit, so we won't lose the information if that file is changed. like: https://github.com/gorilla/websocket/blob/4201258b820c74ac8e6922fc9e6b52f71fe46f8d/conn.go#L761

Increase the websocket read and write timeouts
as per review comments
@aaithal aaithal merged commit b3db90b into aws:dev Sep 28, 2017
@jhaynes jhaynes mentioned this pull request Sep 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants