Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection pool deadlocks with v5 (regression from v4) #1354

Closed
benweint opened this issue Oct 26, 2022 · 11 comments
Closed

Connection pool deadlocks with v5 (regression from v4) #1354

benweint opened this issue Oct 26, 2022 · 11 comments
Labels

Comments

@benweint
Copy link
Contributor

Describe the bug
We're seeing an issue wherein we can consistently reproduce application hangs with pgx/v5 that do not occur with pgx/v4. The only changes to the application itself between the pgx/v4-based version that works and the pgx/v5 version that doesn't are those necessitated by the API differences between the two versions (namely, pgxpool.ConnectConfig vs. pgxpool.NewWithConfig).

To Reproduce

I don't have a compact and easily shareable reproduction case yet, but here's the outline of what we're doing:

  1. Deploy an application that uses pgx/v5 to read from a Postgres 14 database. In our case, the DB happens to be running on GCP's CloudSQL service, and the DB in question is configured using CloudSQL's high-availability feature.
  2. Confirm that the application is working as expected (able to read from the DB). In our case, all of the application's reads happen via the QueryRow method.
  3. Initiate a CloudSQL database failover as described here.
  4. Attempt to exercise the application again, and observe that DB calls all hang under the following stack trace:
goroutine 43048 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0x101f64a68?, 0x20?, 0x14000e38c08?)
	/Users/ben/.asdf/installs/golang/1.19/go/src/runtime/sema.go:77 +0x24
sync.(*Mutex).lockSlow(0x14000464618)
	/Users/ben/.asdf/installs/golang/1.19/go/src/sync/mutex.go:171 +0x178
sync.(*Mutex).Lock(0x14000e38d08?)
	/Users/ben/.asdf/installs/golang/1.19/go/src/sync/mutex.go:90 +0x7c
github.com/jackc/puddle/v2.(*Pool[...]).Acquire(0x140001e08c0, {0x10135cf28, 0x14002b31320})
	/Users/ben/go/pkg/mod/github.com/jackc/puddle/[email protected]/pool.go:413 +0x2a4
github.com/jackc/pgx/v5/pgxpool.(*Pool).Acquire(0x1400018c3f0, {0x10135cf28, 0x14002b31320})
	/Users/ben/go/pkg/mod/github.com/jackc/pgx/[email protected]/pgxpool/pool.go:496 +0x50
github.com/jackc/pgx/v5/pgxpool.(*Pool).QueryRow(0x1400115c48d?, {0x10135cf28, 0x14002b31320}, {0x100fc7b14, 0x4d}, {0x140019994e0, 0x1, 0x1})
	/Users/ben/go/pkg/mod/github.com/jackc/pgx/[email protected]/pgxpool/pool.go:623 +0x38
< ... snip - app code below here ... >

From what I can tell, all of the request handling goroutines are blocked under this same stack, and there don't seem to be any that are actually holding DB connections.

I've also been able to reproduce this running entirely locally by interposing Toxiproxy in between the application and the DB, and adding a timeout toxic to the proxied connection. Even after this toxic is removed, the application will not recover once it's gotten into this state.

Interestingly, the application does recover if the duration of the toxic being in place is short enough (a few seconds). Even with a high rate of incoming requests (500 reqs/s with a concurrency of 500), it seems to take anywhere from 10 - 60 seconds for it to get stuck like this, suggesting a timing-dependent issue of some kind.

I'm hoping to be able to winnow this down to an easily-shareable test case soon.

Expected behavior

I would expect for the application to be impacted while the DB is unreachable or unresponsive, but that the pool should become usable again once the DB recovers.

Actual behavior

If the DB unreachability goes on for long enough, the pool will eventually enter a state wherein any goroutine that attempts to Acquire a connection hangs indefinitely.

Version

  • Go: go version go1.19 darwin/arm64
  • PostgreSQL: PostgreSQL 14.5 (Debian 14.5-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
  • pgx: github.com/jackc/pgx/v5 v5.0.2
@benweint benweint added the bug label Oct 26, 2022
@benweint
Copy link
Contributor Author

Ok, here's a test case:

package main

import (
	"context"
	"fmt"
	"github.com/jackc/pgx/v5/pgxpool"
	"os"
	"sync"
	"time"
)

func main() {
	// urlExample := "postgres://username:password@localhost:5432/database_name"
	ctx := context.Background()

	cfg, err := pgxpool.ParseConfig(os.Getenv("DATABASE_URL"))
	if err != nil {
		panic(err)
	}

	cfg.MinConns = 100
	cfg.MaxConns = 100

	pool, err := pgxpool.NewWithConfig(context.Background(), cfg)
	if err != nil {
		panic(err)
	}
	defer pool.Close()

	concurrency := 100
	var wg sync.WaitGroup

	for i := 0; i < concurrency; i++ {
		wg.Add(1)
		go runQueries(ctx, fmt.Sprintf("%d", i), pool)
	}

	wg.Wait()
}

func runQueries(ctx context.Context, id string, pool *pgxpool.Pool) {
	for {
		runOnce(ctx, id, pool)
	}
}

func runOnce(ctx context.Context, id string, pool *pgxpool.Pool) {
	ctx, cancel := context.WithTimeout(ctx, 1*time.Second)
	defer cancel()

	var one int
	err := pool.QueryRow(ctx, "SELECT 1").Scan(&one)
	if err != nil {
		fmt.Printf("got error in %s: %v\n", id, err)
		return
	}
	fmt.Printf("ok %s!\n", id)
}

Run with:

DATABASE_URL=postgres://<user>:<pass>@localhost:<port>/<db> go run .

If connected directly to a DB, works fine. To reproduce this issue, the easiest thing I've found so far is to use Toxiproxy:

Start the server with toxiproxy-server. Create a proxy listening on some port, pointing to your local Postgres:

toxiproxy-cli create --listen localhost:5435 --upstream localhost:5432 pg

Run the test case again, pointed at the proxy listening port (5435 in this example) instead of directly at the DB. Verify that it works (you should see lots of ok! messages).

Once it's been running for a bit, add a timeout toxic like this:

toxiproxy-cli toxic add --type timeout --attribute timeout=3 --toxicName to pg

Let the test case continue running, and watch it eventually grind to a halt. You can remove the toxic like this:

toxiproxy-cli toxic remove --toxicName to pg   

... but the app is now dead and won't recover from this state.

@benweint
Copy link
Contributor Author

Same test case with v4 works fine, in the sense that when the toxic is removed, it recovers.

@benweint
Copy link
Contributor Author

I'm actually able to get this to reproduce even more easily (without the need for Toxiproxy) by just running the test case about with the Go race detector turned on. Even without messing with the DB connection, it reliably hangs after a few seconds in this mode on my machine.

@jackc
Copy link
Owner

jackc commented Oct 28, 2022

@benweint Can you try this using puddle v2.1.0? I think it resolves the issue.

@benweint
Copy link
Contributor Author

Thanks @jackc - it looks like puddle 2.1.0 does indeed make this test case work!

We had actually tried out the fix in jackc/puddle#24 in a production-like environment (mirrored production traffic) and saw some bad connection-related behavior with that change, but it's possible that that was misdiagnosed or caused by something else at the application layer.

We should be able to try again with puddle v2.1.0 in that same environment next week to confirm whether what we saw recurs or if it's resolved.

(cc @jorgerasillo, who will probably be the one to try this next week)

@jon-whit
Copy link

jon-whit commented Nov 3, 2022

We're encountering this issue as well 👍

@jorgerasillo
Copy link

Hi @jackc,
we've been running w/ puddle v2.1.0 for a little over 24hrs and have not seen this issue occur. We actually triggered a series of failovers and the application has been able to recover successfully.

Thanks for the quick turn around!

@jon-whit
Copy link

jon-whit commented Nov 3, 2022

@jackc any reason not to issue a new release with puddle v2.1.0?

@jackc
Copy link
Owner

jackc commented Nov 3, 2022

any reason not to issue a new release with puddle v2.1.0?

puddle v2.1.0 requires Go 1.19 and pgx needs to support Go 1.18 until Go 1.20 is released.

@hochhaus
Copy link

hochhaus commented Nov 4, 2022

any reason not to issue a new release with puddle v2.1.0?

puddle v2.1.0 requires Go 1.19 and pgx needs to support Go 1.18 until Go 1.20 is released.

Thanks for your work @jackc.

I understand the rationale but this is unfortunate. Should a clear note be added to the README for users of v5 pgxpool that either they need to use puddle v2.1.0 or stick to using v4 until go 1.20 is released?

jackc added a commit to jackc/puddle that referenced this issue Nov 12, 2022
This reverts commit 86ffb01.

Given jackc/pgx#1354 the temporary dependency
on go.uber.org/atomic for Go 1.18 support is worthwhile.
@jackc
Copy link
Owner

jackc commented Nov 12, 2022

I've reverted the change in puddle that requires Go 1.19 at the expense of adding a dependency to go.uber.org/atomic. I had hoped to avoid that, but given v2.1.0+ has an important fix it's worth allowing a dependency for now.

pgx v5.1.0 now depends on the puddle version with the fix.

@jackc jackc closed this as completed Nov 12, 2022
atlas-2192 added a commit to atlas-2192/Resource-For-Go that referenced this issue Dec 6, 2024
This reverts commit 86ffb019e1946f3158391f64e8d944787bdbc2b3.

Given jackc/pgx#1354 the temporary dependency
on go.uber.org/atomic for Go 1.18 support is worthwhile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants