concurrency.NewSession hang after etcd server is killed with SIGSTOP(19) #14631

haojinming · 2022-10-26T13:54:57Z

What happened?

concurrency.NewSession hang after etcd server is kill by SIGSTOP(19)

What did you expect to happen?

NewSession can return error after server is killed.

How can we reproduce it (as minimally and precisely as possible)?

start three or more etcd server nodes.
run main with following codes.
kill -19 pidof etcd leader

package main

import (
	"fmt"
	"time"

	"github.com/pingcap/log"
	clientv3 "go.etcd.io/etcd/client/v3"
	"go.etcd.io/etcd/client/v3/concurrency"
	"go.uber.org/zap"
)

func initEtcdClient() *clientv3.Client {
	var client *clientv3.Client
	var err error
	endpoints := []string{"172.16.5.32:2379", "172.16.5.32:2382", "172.16.5.32:2384"}
	client, err = clientv3.New(clientv3.Config{
		Endpoints:            endpoints,
		DialTimeout:          5 * time.Second,
		DialKeepAliveTimeout: 5 * time.Second,
	})
	if err != nil {
		fmt.Printf("create client fail:%v\\n", err)
		log.Panic("create client fail", zap.Error(err))
	}
	return client
}

func main() {

	number := 0
	client := initEtcdClient()
	for {
		log.Info("create session begin.", zap.Int("time", number))
		s, err := concurrency.NewSession(client)
		if err != nil {
			log.Panic("create client fail", zap.Error(err))
		}
		log.Info("create session finish.", zap.Int("time", number))
		s.Close()
		number++
		time.Sleep(time.Second)
	}
}

Anything else we need to know?

If re-create etcd client after kill -19, it can return error. However, in our application, the client is created at the beginning and stored to use in the while lifecycle of the application.

Etcd version (please run commands below)

go.etcd.io/etcd/client/v3 v3.5.5 ```console $ etcd --version # paste output here

$ etcdctl version

paste output here


</details>


### Etcd configuration (command line flags or environment variables)

<details>

# paste your configuration here

</details>


### Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

<details>

```console
$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

haojinming · 2022-10-27T02:59:18Z

If kill etcd leader with SIGKILL(9), the following error log will print and NewSession can continue acting after new leader is elected.

{"level":"warn","ts":"2022-10-27T10:55:32.940+0800","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000257340/172.16.5.32:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}

I think it should has same behaviour when kill with SIGSTOP, isn't it?

zeminzhou · 2022-11-08T06:17:53Z

@ahrtr How about add a configuration item that controls the timeout of grpc function calls when create the etcd client？

MatteoGioioso · 2022-11-12T08:20:03Z

I might be completely wrong here, but I think the cause of this behavior is this option: defaultWaitForReady = grpc.WaitForReady(true) in /go.etcd.io/etcd/client/[email protected]/options.go;
since is not accessible I have tried to set it to false by temporarily modify the file and in my case it does not hang anymore.

zeminzhou · 2022-11-14T03:12:41Z

Thanks, I changed grpc.WaitForReady to false, it works. But I don't understand why grpc.WaitForReady is hardcoded to true.

MatteoGioioso · 2022-11-14T04:40:12Z

I have no idea, I have found an old issue that proposed to expose some of those options, but it went on stale.

Here is the issue: #13344

halegreen · 2023-01-07T06:53:05Z

SIGSTOP(19) of the etcd leader, new leader won't be selected ?

haojinming · 2023-01-09T01:38:37Z

SIGSTOP(19) of the etcd leader, new leader won't be selected ?

Yes, the main program hang without any log and action.

huangjiao-heart · 2023-02-22T01:59:39Z

Thanks, I changed grpc.WaitForReady to false, it works. But I don't understand why grpc.WaitForReady is hardcoded to true.
@zeminzhou
I also encountered the same problem, How to change this value in our code?

MatteoGioioso · 2023-02-22T12:23:04Z

I also encountered the same problem, How to change this value in our code?

@huangjiao-heart unfortunately you can't as it is a private variable, you have to change it from the package.

To be more clear, you have to literally open the file where the variable is hardcoded and change it manually. Or either make a fork and change it from there.

AngstyDuck · 2023-07-19T13:48:17Z

I'm still quite new to this, but adding a timeout to the LeaseGrant request seem to raise the appropriate error when the servers are killed. Would there be any negative repercussions if we do so?

fuweid · 2023-07-20T03:06:09Z

There are two cases:

If the ETCD server has been paused by SIGSTOP, the client will wait for the connection ready because of WaitForReady. It can be solved by exporting option.
If the ETCD server is paused after connection ready, the client will wait for http2 response forever. It should be handled by ctx option when you new session.

I'm not sure that reason about pausing the process. It's freezing. It will impact all the ready connections.
If you want to stop the server, you should use SIGTERM or SIGKILL.

AngstyDuck · 2023-07-23T15:18:30Z

If you want to stop the server, you should use SIGTERM or SIGKILL.

This bug can be replicated if the server is stopped by SIGTERM or SIGKILL as well.

The freeze occurs during the creation of a new session, where client attempts to send a LeaseGrant request to the server. Because the grpc option WaitForReady is set to true, the creation of the RPC is queued until the connection is ready, without returning any errors.

I understand from the provided documentation that this option is set to minimise error responses from transient failures. In that case, I'd like to propose setting a timeout to the LeaseGrant gRPC request specifically during the creation of new sessions. I'm thinking of exposing the timeout duration as a configurable value (I'm thinking of naming it as SessionCreationTimeout for now) in the Config object in etcd/client/v3/config.go. Do let me know if this has any negative side effects, otherwise I'll be preparing a PR for this over the next few days 👍

fuweid · 2023-07-24T02:30:38Z

@AngstyDuck Yes. gRPC has a background goroutine picker to collect available connection to new transporter. I think it's good option to user if they want to disable waitForReady.

haojinming added the type/bug label Oct 26, 2022

haojinming changed the title ~~concurrency.NewSession hang after etcd server is kill by SIGSTOP(19)~~ concurrency.NewSession hang after etcd server is killed with SIGSTOP(19) Oct 27, 2022

ahrtr mentioned this issue Oct 28, 2022

campagin maybe stuck, when send SIGSTOP to etcd server leader #14641

Closed

serathius added the release/v3.5 label Nov 15, 2022

serathius mentioned this issue Jan 18, 2023

Plan for v3.5.7 release #15141

Closed

serathius added the help wanted label May 10, 2023

serathius mentioned this issue May 10, 2023

Plans for v3.5.9 release #15871

Closed

4 tasks

CaojiamingAlan mentioned this issue Jul 11, 2023

Improving client call options flexibility and replace retry interceptor #16216

Open

AngstyDuck linked a pull request Aug 7, 2023 that will close this issue

client: Added new session option to set timeout for session creation #16385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concurrency.NewSession hang after etcd server is killed with SIGSTOP(19) #14631

concurrency.NewSession hang after etcd server is killed with SIGSTOP(19) #14631

haojinming commented Oct 26, 2022 •

edited

Loading

paste output here

haojinming commented Oct 27, 2022

zeminzhou commented Nov 8, 2022

MatteoGioioso commented Nov 12, 2022 •

edited

Loading

zeminzhou commented Nov 14, 2022

MatteoGioioso commented Nov 14, 2022

halegreen commented Jan 7, 2023

haojinming commented Jan 9, 2023

huangjiao-heart commented Feb 22, 2023 •

edited

Loading

MatteoGioioso commented Feb 22, 2023 •

edited

Loading

AngstyDuck commented Jul 19, 2023

fuweid commented Jul 20, 2023 •

edited

Loading

AngstyDuck commented Jul 23, 2023 •

edited

Loading

fuweid commented Jul 24, 2023

concurrency.NewSession hang after etcd server is killed with SIGSTOP(19) #14631

concurrency.NewSession hang after etcd server is killed with SIGSTOP(19) #14631

Comments

haojinming commented Oct 26, 2022 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

paste output here

Relevant log output

haojinming commented Oct 27, 2022

zeminzhou commented Nov 8, 2022

MatteoGioioso commented Nov 12, 2022 • edited Loading

zeminzhou commented Nov 14, 2022

MatteoGioioso commented Nov 14, 2022

halegreen commented Jan 7, 2023

haojinming commented Jan 9, 2023

huangjiao-heart commented Feb 22, 2023 • edited Loading

MatteoGioioso commented Feb 22, 2023 • edited Loading

AngstyDuck commented Jul 19, 2023

fuweid commented Jul 20, 2023 • edited Loading

AngstyDuck commented Jul 23, 2023 • edited Loading

fuweid commented Jul 24, 2023

haojinming commented Oct 26, 2022 •

edited

Loading

MatteoGioioso commented Nov 12, 2022 •

edited

Loading

huangjiao-heart commented Feb 22, 2023 •

edited

Loading

MatteoGioioso commented Feb 22, 2023 •

edited

Loading

fuweid commented Jul 20, 2023 •

edited

Loading

AngstyDuck commented Jul 23, 2023 •

edited

Loading