Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: ensure that a pgwire-level cancellation request (using a cancel msg) prevents a client freeze #32973

Closed
chessman opened this issue Dec 10, 2018 · 19 comments · Fixed by #33202
Assignees
Labels
A-sql-pgwire pgwire protocol issues. C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community

Comments

@chessman
Copy link

chessman commented Dec 10, 2018

Issue summary (Added @knz 2018-12-14)

The client tries to cancel its request with a pg cancel message and then gets stuck (until a timeout of 2mn) because the server does not respond. Instead CockroachDB should see the pg cancel message and immediately return an error packet and/or close the connection.

Describe the problem

  1. Create an insert request.
  2. Cancel context when the request is executing.

The request will be terminated exactly in 2 minutes.

To Reproduce

	dbname := "db" + strconv.Itoa(int(uuid.New().ID()))
	connString := fmt.Sprintf("postgresql://root@localhost:26257/%s?sslmode=disable", dbname)
	db, err := sql.Open("postgres", connString)
	if err != nil {
		panic(err)
	}
	_, err = db.Exec(`CREATE DATABASE ` + dbname)
	if err != nil {
		panic("cannot create database: " + err.Error())
	}

	defer db.Exec(`DROP DATABASE ` + dbname)

	_, err = db.Exec(`CREATE TABLE IF NOT EXISTS table1 (id STRING PRIMARY KEY)`)
	if err != nil {
		panic("cannot create schema: " + err.Error())
	}

	for {
		ctx := context.Background()
		ctx, cancel := context.WithCancel(ctx)
		go func() {
			time.Sleep(15 * time.Millisecond)
			cancel()
		}()
		started := time.Now()
		query := `INSERT INTO table1 (id) VALUES ($1)`
		id := uuid.New().String()

		_, err := db.ExecContext(ctx, query, id)
		if err != nil && err != context.Canceled {
			panic(err)
		}

		fmt.Println("inserted:", time.Since(started))
		if time.Since(started) > 1*time.Second {
			panic("too long")
		}
	}

Expected behavior
The request should be terminated immediately after the context is canceled.

Environment:

  • CCL v2.1.1 @ 2018/11/19 18:24:21 (go1.10.3)
  • Server OS: Debian
  • Golang (database/sql, lib/pq)
@tbg tbg changed the title Request freezes for 2 minutes if context is canceled during insert sql: r freezes for 2 minutes if context is canceled during insert Dec 12, 2018
@tbg tbg changed the title sql: r freezes for 2 minutes if context is canceled during insert sql: request freezes for 2 minutes if context is canceled during insert Dec 12, 2018
@tbg tbg assigned knz and unassigned jordanlewis Dec 12, 2018
@tbg
Copy link
Member

tbg commented Dec 12, 2018

@knz could you triage?

@knz knz added C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community A-sql-pgwire pgwire protocol issues. labels Dec 13, 2018
@knz
Copy link
Contributor

knz commented Dec 13, 2018

@mjibson what do you think? Is this just driver-side or is there anything to do on our end?

@knz knz removed their assignment Dec 13, 2018
@chessman
Copy link
Author

I just checked it with Postgres and it didn't reproduce.

@maddyblue
Copy link
Contributor

Can you try with https://github.com/jackc/pgx? I'm doubtful this is a cockroach bug. We already don't support the cancel protocol (which isn't a bug, we don't currently intend to support it) and I'm curious if something is waiting for a timeout there.

@chessman
Copy link
Author

@mjibson I can't reproduce it with github.com/jackc/pgx. It seems, it more properly handles context cancelation. But still strange that this bug doesn't reproduce with Postgres and lib/pq.

@maddyblue
Copy link
Contributor

Did you use pgx directly or via database/sql? database/sql and lib/pq may be interacting in broken ways.

@chessman
Copy link
Author

All tests are via database/sql.

That's stacktraces from lib/pq:

 	0x70e1c1	github.com/lib/pq.(*conn).watchCancel.func2+0xc1	/home/ea/golang/src/github.com/lib/pq/conn_go18.go:91
#	0x703f80	github.com/lib/pq.(*conn).ExecContext+0x210		/home/ea/golang/src/github.com/lib/pq/conn_go18.go:43
#	0x4d5fd4	database/sql.ctxDriverExec+0x234			/usr/lib/go-1.10/src/database/sql/ctxutil.go:31
#	0x4e455f	database/sql.(*DB).execDC.func2+0x1df			/usr/lib/go-1.10/src/database/sql/sql.go:1391
#	0x4e3d92	database/sql.withLock+0x62				/usr/lib/go-1.10/src/database/sql/sql.go:3032
#	0x4dc477	database/sql.(*DB).execDC+0x477				/usr/lib/go-1.10/src/database/sql/sql.go:1386
#	0x4dbfb0	database/sql.(*DB).exec+0x130				/usr/lib/go-1.10/src/database/sql/sql.go:1371
#	0x4dbbc0	database/sql.(*DB).ExecContext+0xe0			/usr/lib/go-1.10/src/database/sql/sql.go:1349


#	0x428c36	internal/poll.runtime_pollWait+0x56			/usr/lib/go-1.10/src/runtime/netpoll.go:173
#	0x48e45a	internal/poll.(*pollDesc).wait+0x9a			/usr/lib/go-1.10/src/internal/poll/fd_poll_runtime.go:85
#	0x48e4dc	internal/poll.(*pollDesc).waitRead+0x3c			/usr/lib/go-1.10/src/internal/poll/fd_poll_runtime.go:90
#	0x48f33c	internal/poll.(*FD).Read+0x17c				/usr/lib/go-1.10/src/internal/poll/fd_unix.go:157
#	0x57e86e	net.(*netFD).Read+0x4e					/usr/lib/go-1.10/src/net/fd_unix.go:202
#	0x58f979	net.(*conn).Read+0x69					/usr/lib/go-1.10/src/net/net.go:176
#	0x56b0b1	io/ioutil.devNull.ReadFrom+0x91				/usr/lib/go-1.10/src/io/ioutil/ioutil.go:147
#	0x472079	io.copyBuffer+0x319					/usr/lib/go-1.10/src/io/io.go:386
#	0x471c29	io.Copy+0x59						/usr/lib/go-1.10/src/io/io.go:362
#	0x7046ec	github.com/lib/pq.(*conn).cancel+0x34c			/home/ea/golang/src/github.com/lib/pq/conn_go18.go:125
#	0x70e0cd	github.com/lib/pq.(*conn).watchCancel.func1+0xed	/home/ea/golang/src/github.com/lib/pq/conn_go18.go:85

lib/pq sends the Postgres cancel request code:
https://github.com/lib/pq/blob/master/conn_go18.go#L83

Probably, Cockroach doesn't handle it properly.

@knz
Copy link
Contributor

knz commented Dec 14, 2018

@chessman thanks for the additional information. As noted above, CockroachDb does not support the cancel protocol, however it does cancel a query if the connection gets closed. This will also achieve the desired effect.

@chessman
Copy link
Author

lib/pq sends the cancelation code and just waits for EOF. Maybe Cockroach can return EOF for this code? It might be even "not implemented" error, any result will unblock the driver.

@knz
Copy link
Contributor

knz commented Dec 14, 2018

Yes that seems reasonable.

@knz knz changed the title sql: request freezes for 2 minutes if context is canceled during insert sql: ensure that a pgwire-level cancellation request (using a cancel msg) prevents a client freeze Dec 14, 2018
@knz
Copy link
Contributor

knz commented Dec 14, 2018

@mjibson even though we may not support the pg cancel protocol, do you see any way we can detect when it is received while a query is running? And respond with an error immediately? I'm not sure.

Maybe @andreimatei you have an opinion about this too?

@jordanlewis
Copy link
Member

We should be able to do this with a similar strategy that we were just playing with in #33174. We can change the implementation of AddRow to cause the flow to return early if a cancellation message has been received.

@andreimatei
Copy link
Contributor

andreimatei commented Dec 17, 2018

The cancel signal is sent by drivers on new connections, not the connection processing the query being cancelled. These dedicated connections are identified from the get-go; the docs say:

To issue a cancel request, the frontend opens a new connection to the server and sends a CancelRequest message, rather than the StartupMessage message that would ordinarily be sent across a new connection. The server will process this request and then close the connection.

So it seems that we should just close connections when a CancelRequest is received. Probably easy to do.

@maddyblue
Copy link
Contributor

The cancel docs also state that

For security reasons, no direct reply is made to the cancel request message.

Implying that we shouldn't even return a "not implemented" error.

There are unsolved problems when dealing with the cancel protocol:

  1. Cockroach is distributed so we have to figure out what, if anything, to do if a server receives a request for an ID it doesn't recognize. Should it tell the other nodes?
  2. The IDs are two 32bit numbers. Cockroach queries use a single 128bit identifier, so we can't just use that ID verbatim. Do we only use the top or bottom half? Generate (and thus store) a 64bit ID? Use some 64bit hash function on the 128bit ID? Something else?

Answering these questions isn't currently on the roadmap, so I'm not sure anything will move here soon.

@knz
Copy link
Contributor

knz commented Dec 17, 2018

@mjibson we can close the connection upon receiving a cancel message and not do anything else.

@knz
Copy link
Contributor

knz commented Dec 17, 2018

(except bumping a telemetry counter, that is)

@maddyblue
Copy link
Contributor

Yes, we should be doing that. If we aren't that's pretty surprising. Confirmed that psql also hangs when you run pg_sleep and hit ^C, which should do that.

@knz
Copy link
Contributor

knz commented Dec 17, 2018

@mjibson could you casually investigate what the current situation is in the code? ie what cockroachdb does/says wyhen it receives a cancel msg? Depending on the specific current situation maybe we can do something about it.

@maddyblue
Copy link
Contributor

I agree. I will do that.

@knz knz added the fixitday label Dec 17, 2018
craig bot pushed a commit that referenced this issue Dec 18, 2018
33202: pgwire: detect, log, and immediately close any cancel requests r=mjibson a=mjibson

Previously the cmuxer would not detect pgwire cancel requests as pgwire
and instead hand them off to GRPC causing them to hang. We still don't
support cancel, but it should be handled better by clients now.

Fixes #32973

Release note (bug fix): Cancel requests (via the pgwire protocol) will
now close quickly with an EOF instead of hang (but still don't cancel
the request).

Co-authored-by: Matt Jibson <[email protected]>
@craig craig bot closed this as completed in #33202 Dec 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-pgwire pgwire protocol issues. C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants