-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection pool deadlocks with v5 (regression from v4) #1354
Comments
Ok, here's a test case: package main
import (
"context"
"fmt"
"github.com/jackc/pgx/v5/pgxpool"
"os"
"sync"
"time"
)
func main() {
// urlExample := "postgres://username:password@localhost:5432/database_name"
ctx := context.Background()
cfg, err := pgxpool.ParseConfig(os.Getenv("DATABASE_URL"))
if err != nil {
panic(err)
}
cfg.MinConns = 100
cfg.MaxConns = 100
pool, err := pgxpool.NewWithConfig(context.Background(), cfg)
if err != nil {
panic(err)
}
defer pool.Close()
concurrency := 100
var wg sync.WaitGroup
for i := 0; i < concurrency; i++ {
wg.Add(1)
go runQueries(ctx, fmt.Sprintf("%d", i), pool)
}
wg.Wait()
}
func runQueries(ctx context.Context, id string, pool *pgxpool.Pool) {
for {
runOnce(ctx, id, pool)
}
}
func runOnce(ctx context.Context, id string, pool *pgxpool.Pool) {
ctx, cancel := context.WithTimeout(ctx, 1*time.Second)
defer cancel()
var one int
err := pool.QueryRow(ctx, "SELECT 1").Scan(&one)
if err != nil {
fmt.Printf("got error in %s: %v\n", id, err)
return
}
fmt.Printf("ok %s!\n", id)
} Run with:
If connected directly to a DB, works fine. To reproduce this issue, the easiest thing I've found so far is to use Toxiproxy: Start the server with
Run the test case again, pointed at the proxy listening port (5435 in this example) instead of directly at the DB. Verify that it works (you should see lots of Once it's been running for a bit, add a timeout toxic like this:
Let the test case continue running, and watch it eventually grind to a halt. You can remove the toxic like this:
... but the app is now dead and won't recover from this state. |
Same test case with v4 works fine, in the sense that when the toxic is removed, it recovers. |
I'm actually able to get this to reproduce even more easily (without the need for Toxiproxy) by just running the test case about with the Go race detector turned on. Even without messing with the DB connection, it reliably hangs after a few seconds in this mode on my machine. |
@benweint Can you try this using puddle |
Thanks @jackc - it looks like puddle 2.1.0 does indeed make this test case work! We had actually tried out the fix in jackc/puddle#24 in a production-like environment (mirrored production traffic) and saw some bad connection-related behavior with that change, but it's possible that that was misdiagnosed or caused by something else at the application layer. We should be able to try again with puddle v2.1.0 in that same environment next week to confirm whether what we saw recurs or if it's resolved. (cc @jorgerasillo, who will probably be the one to try this next week) |
We're encountering this issue as well 👍 |
Hi @jackc, Thanks for the quick turn around! |
@jackc any reason not to issue a new release with puddle |
puddle |
Thanks for your work @jackc. I understand the rationale but this is unfortunate. Should a clear note be added to the README for users of v5 pgxpool that either they need to use puddle |
This reverts commit 86ffb01. Given jackc/pgx#1354 the temporary dependency on go.uber.org/atomic for Go 1.18 support is worthwhile.
I've reverted the change in puddle that requires Go 1.19 at the expense of adding a dependency to go.uber.org/atomic. I had hoped to avoid that, but given v2.1.0+ has an important fix it's worth allowing a dependency for now. pgx v5.1.0 now depends on the puddle version with the fix. |
This reverts commit 86ffb019e1946f3158391f64e8d944787bdbc2b3. Given jackc/pgx#1354 the temporary dependency on go.uber.org/atomic for Go 1.18 support is worthwhile.
Describe the bug
We're seeing an issue wherein we can consistently reproduce application hangs with
pgx/v5
that do not occur withpgx/v4
. The only changes to the application itself between the pgx/v4-based version that works and the pgx/v5 version that doesn't are those necessitated by the API differences between the two versions (namely,pgxpool.ConnectConfig
vs.pgxpool.NewWithConfig
).To Reproduce
I don't have a compact and easily shareable reproduction case yet, but here's the outline of what we're doing:
pgx/v5
to read from a Postgres 14 database. In our case, the DB happens to be running on GCP's CloudSQL service, and the DB in question is configured using CloudSQL's high-availability feature.QueryRow
method.From what I can tell, all of the request handling goroutines are blocked under this same stack, and there don't seem to be any that are actually holding DB connections.
I've also been able to reproduce this running entirely locally by interposing Toxiproxy in between the application and the DB, and adding a timeout toxic to the proxied connection. Even after this toxic is removed, the application will not recover once it's gotten into this state.
Interestingly, the application does recover if the duration of the toxic being in place is short enough (a few seconds). Even with a high rate of incoming requests (500 reqs/s with a concurrency of 500), it seems to take anywhere from 10 - 60 seconds for it to get stuck like this, suggesting a timing-dependent issue of some kind.
I'm hoping to be able to winnow this down to an easily-shareable test case soon.
Expected behavior
I would expect for the application to be impacted while the DB is unreachable or unresponsive, but that the pool should become usable again once the DB recovers.
Actual behavior
If the DB unreachability goes on for long enough, the pool will eventually enter a state wherein any goroutine that attempts to Acquire a connection hangs indefinitely.
Version
go version go1.19 darwin/arm64
PostgreSQL 14.5 (Debian 14.5-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
The text was updated successfully, but these errors were encountered: