-
Notifications
You must be signed in to change notification settings - Fork 217
Description
[Locket client] gRPC Client improvement
Summary
Recently we did a research on how will diego perform with network delay introduced on a landscape. The tool that we used was turbulence-bosh and basically simulated n seconds delays of every request on a certain vm from the cf-deployment. A detailed summary of the results can be found here. I suggest quickly going over it, because different turbulence scenarios will be referenced multiple times.
One of the weak points we found during the research was in the locket client. Since it is used by bbs, auctioneer and the cells for maintaining presence and locks, it's a really important part of diego. The first case is when we introduce a network delay on one of the diego-api vms ( bbs can be active or passive, the result is the same ). All the cells that are using the locket server on that diego-api loose their presence. When a rep process is started on a cell, the locket client establishes a gRPC connection to a server and that connection does not close. So when we introduce turbulence, the requests in the server take too long to be processed and the TTL runs out. The expected behaviour in this case is that the client will see that the server is taking too long and close the connection so it can re-connect to another server. But that does not happen. For more details, see Scenario 1 in the doc.
The other problem we found is with the bbs and auctioneer. Since they also use a locket client to acquire a lock and maintain it, we can see the same behaviour here. If the active auctioneer is connected to a locket server, which we target with turbulence. It will loose the lock and a failover will occur, even if the auctioneer itself is fine and the attack is on a completely different vm. In this case we again expect the client to become aware that the server is responding slowly and close the connection. For more details, see Scenarios 2 and 3 in the doc.
Diego repo
Locket repo: https://github.com/cloudfoundry/locket
Describe alternatives you've considered (optional)
We currently have the gRPC client setup like this:
conn, err := grpc.Dial(
config.LocketAddress,
grpc.WithTransportCredentials(credentials.NewTLS(locketTLSConfig)),
grpc.WithDialer(func(addr string, _ time.Duration) (net.Conn, error) {
return net.DialTimeout("tcp", addr, 10*time.Second) // give at least 2 seconds per ip address (assuming there are at most 5)
}),
grpc.WithBlock(),
grpc.WithTimeout(10*time.Second), // ensure that grpc won't keep retrying forever
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 10 * time.Second,
We have a keepalive ping that is send to the server every 10s, but we do not have a timeout for that ping. If we set a timeout to 3s for example, the client will send a ping to the server and if it is not able to respond in those 3s, the connection will be closed and the client will connect to another server. For more info: https://github.com/grpc/grpc/blob/master/doc/keepalive.md
We have tested this on a dev landscape and the results are great. The cells are able to connect to a healthy locket server and not loose their presence. The same applies to the bbs and auctioneer. No lost locks and no unneeded failovers. So in our opinion, this will greatly improve how diego performs, when we have network delays.
Additional Text Output, Screenshots, or contextual information (optional)
I think it is a good idea to make those properties of the gRPC client ( Time and Timeout )part of the locket config, so the client behaviour can be tuned for the use case.