Skip to content

Commit

Permalink
Add timeout to connection.Connect()
Browse files Browse the repository at this point in the history
If connection.Connect() gets stuck it will block infinitely, never
reporting an error to the library's consumer.

We have received reports of this happening even when a _valid_ address
is passed in node-driver-registrar because of a race condition involving
pod restarts. Specifically, there is an unlucky sequence of events when
restarting both the CSI Driver and node-driver-registrar, wheere the
node-driver-registar will attempt to Connect() to the CSI Driver, but
will get stuck doing so on an old file descriptor from the previously
running CSI Driver (and thus, get stuck infinitely).

There is no mechanism to pass a connection timeout to Connect, so this
commit adds a reasonbly long default timeout so these cases will
eventually return an error rather than getting stuck infinitely.

Signed-off-by: Connor Catlett <[email protected]>
  • Loading branch information
ConnorJC3 committed Jun 14, 2023
1 parent 597d128 commit bbcd132
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions connection/connection.go
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ func SetMaxGRPCLogLength(characterCount int) {
// file or have format '<protocol>://', following gRPC name resolution mechanism at
// https://github.com/grpc/grpc/blob/master/doc/naming.md.
//
// The function tries to connect indefinitely every second until it connects. The function automatically disables TLS
// and adds interceptor for logging of all gRPC messages at level 5.
// The function tries to connect for 30 seconds, and returns an error if no connection has been established at that point.
// The function automatically disables TLS and adds interceptor for logging of all gRPC messages at level 5.
//
// For a connection to a Unix Domain socket, the behavior after
// loosing the connection is configurable. The default is to
Expand All @@ -70,7 +70,7 @@ func SetMaxGRPCLogLength(characterCount int) {
// For other connections, the default behavior from gRPC is used and
// loss of connection is not detected reliably.
func Connect(address string, metricsManager metrics.CSIMetricsManager, options ...Option) (*grpc.ClientConn, error) {
return connect(address, metricsManager, []grpc.DialOption{}, options)
return connect(address, metricsManager, []grpc.DialOption{grpc.WithTimeout(time.Second * 30)}, options)
}

// Option is the type of all optional parameters for Connect.
Expand Down

1 comment on commit bbcd132

@adarsh-dell
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ConnorJC3, could you please provide insight into the rationale behind implementing this timeout? Given that many CSI-sidecars rely on this utility, imagine a scenario with two controllers where only one is the leader. In such a case, the non-leading controller is unable to connect. In the previous setup with an infinite timeout, the controllers attempted to establish a connection indefinitely. As a result, the controller pod remained in a running state, avoiding crashloopbackoff. Once the pod assumed the leader role, the session was established seamlessly but because of this change now pod is in crashloopbackoff.

Please sign in to comment.