Connection pool deadlocks with v5 (regression from v4) #1354

benweint · 2022-10-26T03:32:01Z

Describe the bug
We're seeing an issue wherein we can consistently reproduce application hangs with pgx/v5 that do not occur with pgx/v4. The only changes to the application itself between the pgx/v4-based version that works and the pgx/v5 version that doesn't are those necessitated by the API differences between the two versions (namely, pgxpool.ConnectConfig vs. pgxpool.NewWithConfig).

To Reproduce

I don't have a compact and easily shareable reproduction case yet, but here's the outline of what we're doing:

Deploy an application that uses pgx/v5 to read from a Postgres 14 database. In our case, the DB happens to be running on GCP's CloudSQL service, and the DB in question is configured using CloudSQL's high-availability feature.
Confirm that the application is working as expected (able to read from the DB). In our case, all of the application's reads happen via the QueryRow method.
Initiate a CloudSQL database failover as described here.
Attempt to exercise the application again, and observe that DB calls all hang under the following stack trace:

goroutine 43048 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0x101f64a68?, 0x20?, 0x14000e38c08?)
	/Users/ben/.asdf/installs/golang/1.19/go/src/runtime/sema.go:77 +0x24
sync.(*Mutex).lockSlow(0x14000464618)
	/Users/ben/.asdf/installs/golang/1.19/go/src/sync/mutex.go:171 +0x178
sync.(*Mutex).Lock(0x14000e38d08?)
	/Users/ben/.asdf/installs/golang/1.19/go/src/sync/mutex.go:90 +0x7c
github.com/jackc/puddle/v2.(*Pool[...]).Acquire(0x140001e08c0, {0x10135cf28, 0x14002b31320})
	/Users/ben/go/pkg/mod/github.com/jackc/puddle/[email protected]/pool.go:413 +0x2a4
github.com/jackc/pgx/v5/pgxpool.(*Pool).Acquire(0x1400018c3f0, {0x10135cf28, 0x14002b31320})
	/Users/ben/go/pkg/mod/github.com/jackc/pgx/[email protected]/pgxpool/pool.go:496 +0x50
github.com/jackc/pgx/v5/pgxpool.(*Pool).QueryRow(0x1400115c48d?, {0x10135cf28, 0x14002b31320}, {0x100fc7b14, 0x4d}, {0x140019994e0, 0x1, 0x1})
	/Users/ben/go/pkg/mod/github.com/jackc/pgx/[email protected]/pgxpool/pool.go:623 +0x38
< ... snip - app code below here ... >

From what I can tell, all of the request handling goroutines are blocked under this same stack, and there don't seem to be any that are actually holding DB connections.

I've also been able to reproduce this running entirely locally by interposing Toxiproxy in between the application and the DB, and adding a timeout toxic to the proxied connection. Even after this toxic is removed, the application will not recover once it's gotten into this state.

Interestingly, the application does recover if the duration of the toxic being in place is short enough (a few seconds). Even with a high rate of incoming requests (500 reqs/s with a concurrency of 500), it seems to take anywhere from 10 - 60 seconds for it to get stuck like this, suggesting a timing-dependent issue of some kind.

I'm hoping to be able to winnow this down to an easily-shareable test case soon.

Expected behavior

I would expect for the application to be impacted while the DB is unreachable or unresponsive, but that the pool should become usable again once the DB recovers.

Actual behavior

If the DB unreachability goes on for long enough, the pool will eventually enter a state wherein any goroutine that attempts to Acquire a connection hangs indefinitely.

Version

Go: go version go1.19 darwin/arm64
PostgreSQL: PostgreSQL 14.5 (Debian 14.5-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
pgx: github.com/jackc/pgx/v5 v5.0.2

The text was updated successfully, but these errors were encountered:

benweint · 2022-10-26T03:59:17Z

Ok, here's a test case:

package main

import (
	"context"
	"fmt"
	"github.com/jackc/pgx/v5/pgxpool"
	"os"
	"sync"
	"time"
)

func main() {
	// urlExample := "postgres://username:password@localhost:5432/database_name"
	ctx := context.Background()

	cfg, err := pgxpool.ParseConfig(os.Getenv("DATABASE_URL"))
	if err != nil {
		panic(err)
	}

	cfg.MinConns = 100
	cfg.MaxConns = 100

	pool, err := pgxpool.NewWithConfig(context.Background(), cfg)
	if err != nil {
		panic(err)
	}
	defer pool.Close()

	concurrency := 100
	var wg sync.WaitGroup

	for i := 0; i < concurrency; i++ {
		wg.Add(1)
		go runQueries(ctx, fmt.Sprintf("%d", i), pool)
	}

	wg.Wait()
}

func runQueries(ctx context.Context, id string, pool *pgxpool.Pool) {
	for {
		runOnce(ctx, id, pool)
	}
}

func runOnce(ctx context.Context, id string, pool *pgxpool.Pool) {
	ctx, cancel := context.WithTimeout(ctx, 1*time.Second)
	defer cancel()

	var one int
	err := pool.QueryRow(ctx, "SELECT 1").Scan(&one)
	if err != nil {
		fmt.Printf("got error in %s: %v\n", id, err)
		return
	}
	fmt.Printf("ok %s!\n", id)
}

Run with:

DATABASE_URL=postgres://<user>:<pass>@localhost:<port>/<db> go run .

If connected directly to a DB, works fine. To reproduce this issue, the easiest thing I've found so far is to use Toxiproxy:

Start the server with toxiproxy-server. Create a proxy listening on some port, pointing to your local Postgres:

toxiproxy-cli create --listen localhost:5435 --upstream localhost:5432 pg

Run the test case again, pointed at the proxy listening port (5435 in this example) instead of directly at the DB. Verify that it works (you should see lots of ok! messages).

Once it's been running for a bit, add a timeout toxic like this:

toxiproxy-cli toxic add --type timeout --attribute timeout=3 --toxicName to pg

Let the test case continue running, and watch it eventually grind to a halt. You can remove the toxic like this:

toxiproxy-cli toxic remove --toxicName to pg

... but the app is now dead and won't recover from this state.

benweint · 2022-10-26T04:02:03Z

Same test case with v4 works fine, in the sense that when the toxic is removed, it recovers.

benweint · 2022-10-26T17:43:54Z

I'm actually able to get this to reproduce even more easily (without the need for Toxiproxy) by just running the test case about with the Go race detector turned on. Even without messing with the DB connection, it reliably hangs after a few seconds in this mode on my machine.

jackc · 2022-10-28T23:43:51Z

@benweint Can you try this using puddle v2.1.0? I think it resolves the issue.

benweint · 2022-10-29T02:54:26Z

Thanks @jackc - it looks like puddle 2.1.0 does indeed make this test case work!

We had actually tried out the fix in jackc/puddle#24 in a production-like environment (mirrored production traffic) and saw some bad connection-related behavior with that change, but it's possible that that was misdiagnosed or caused by something else at the application layer.

We should be able to try again with puddle v2.1.0 in that same environment next week to confirm whether what we saw recurs or if it's resolved.

(cc @jorgerasillo, who will probably be the one to try this next week)

jon-whit · 2022-11-03T18:04:09Z

We're encountering this issue as well 👍

jorgerasillo · 2022-11-03T20:09:16Z

Hi @jackc,
we've been running w/ puddle v2.1.0 for a little over 24hrs and have not seen this issue occur. We actually triggered a series of failovers and the application has been able to recover successfully.

Thanks for the quick turn around!

jon-whit · 2022-11-03T20:30:53Z

@jackc any reason not to issue a new release with puddle v2.1.0?

jackc · 2022-11-03T22:44:23Z

any reason not to issue a new release with puddle v2.1.0?

puddle v2.1.0 requires Go 1.19 and pgx needs to support Go 1.18 until Go 1.20 is released.

hochhaus · 2022-11-04T18:07:14Z

any reason not to issue a new release with puddle v2.1.0?

puddle v2.1.0 requires Go 1.19 and pgx needs to support Go 1.18 until Go 1.20 is released.

Thanks for your work @jackc.

I understand the rationale but this is unfortunate. Should a clear note be added to the README for users of v5 pgxpool that either they need to use puddle v2.1.0 or stick to using v4 until go 1.20 is released?

This reverts commit 86ffb01. Given jackc/pgx#1354 the temporary dependency on go.uber.org/atomic for Go 1.18 support is worthwhile.

jackc · 2022-11-12T17:07:12Z

I've reverted the change in puddle that requires Go 1.19 at the expense of adding a dependency to go.uber.org/atomic. I had hoped to avoid that, but given v2.1.0+ has an important fix it's worth allowing a dependency for now.

pgx v5.1.0 now depends on the puddle version with the fix.

This reverts commit 86ffb019e1946f3158391f64e8d944787bdbc2b3. Given jackc/pgx#1354 the temporary dependency on go.uber.org/atomic for Go 1.18 support is worthwhile.

benweint added the bug label Oct 26, 2022

jorgerasillo mentioned this issue Oct 26, 2022

Replace conditional variable with semaphore jackc/puddle#24

Closed

jon-whit mentioned this issue Nov 4, 2022

chore: refactor postgres to use database/sql openfga/openfga#329

Merged

4 tasks

jackc added a commit to jackc/puddle that referenced this issue Nov 12, 2022

Revert "Use Go 1.19 atomics"

fc4042c

This reverts commit 86ffb01. Given jackc/pgx#1354 the temporary dependency on go.uber.org/atomic for Go 1.18 support is worthwhile.

jackc closed this as completed Nov 12, 2022

jackc mentioned this issue Nov 12, 2022

Reviewers needed for major connection pool changes #1330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection pool deadlocks with v5 (regression from v4) #1354

Connection pool deadlocks with v5 (regression from v4) #1354

benweint commented Oct 26, 2022

benweint commented Oct 26, 2022

benweint commented Oct 26, 2022

benweint commented Oct 26, 2022

jackc commented Oct 28, 2022

benweint commented Oct 29, 2022

jon-whit commented Nov 3, 2022

jorgerasillo commented Nov 3, 2022

jon-whit commented Nov 3, 2022

jackc commented Nov 3, 2022

hochhaus commented Nov 4, 2022

jackc commented Nov 12, 2022

Connection pool deadlocks with v5 (regression from v4) #1354

Connection pool deadlocks with v5 (regression from v4) #1354

Comments

benweint commented Oct 26, 2022

benweint commented Oct 26, 2022

benweint commented Oct 26, 2022

benweint commented Oct 26, 2022

jackc commented Oct 28, 2022

benweint commented Oct 29, 2022

jon-whit commented Nov 3, 2022

jorgerasillo commented Nov 3, 2022

jon-whit commented Nov 3, 2022

jackc commented Nov 3, 2022

hochhaus commented Nov 4, 2022

jackc commented Nov 12, 2022