Connection leak? #1777

johanneswuerbach · 2018-12-01T21:39:57Z

We recently updated some of our postgres client libraries and experience an almost total pool exhaustion roughly 2~3h are deploying the new version in production (never happened before) on 2 out of 20 DBs.

We updated the following libraries.
pg 7.4.3 ~> 7.6.1
pg-pool 2.0.3 ~> 2.0.4
pg-promise 7.5.4 ~> 8.5.2

Once the pool is slowly exhausting a majority of queries hitting an affected process are returned with "timeout exceeded when trying to connect" https://github.com/brianc/node-pg-pool/blob/v2.0.4/index.js#L178 and the percentage is increasing overtime and condition is persistent (~30min until revert)

I'm debugging this for multiple days now, but have a hard time identifying the exact root cause, so far suspected:

brianc/node-pg-pool#86, which means we generally queue more work now as no longer all pending queue items are dropped, but we rarely saw "timeout exceeded when trying to connect" errors before so this seems unlikely

#1503 some kind of race condition here as both affected DBs occasionally are hit by queries running into statement timeouts.

Any ideas, pointer, potential areas for races would be really appreciated.

//cc @vitaly-t I know that pg-promise is not part of pg distribution, but you seemed really active here and I would prefer a single spot for discussion. Any insight would be appreciated.

We mainly (but not exclusively) use nested transactions via https://github.com/vitaly-t/pg-promise#transactions starting with SET LOCAL statement_timeout = 30000;

The text was updated successfully, but these errors were encountered:

vitaly-t · 2018-12-01T22:37:14Z

Can you identify which version exactly introduced the issue? So we can narrow down possible breaking changes.

johanneswuerbach · 2018-12-01T22:44:32Z

Not really as once the condition hits, we return a lot of errors on our APIs, which are used by customers.

I'm trying to replica this in our staging environment or in specific integration tests, but haven't been lucky so far.

vitaly-t · 2018-12-01T22:47:58Z

Well, we'd need to know which version broke it, unless there is something wrong in your code. For example, sometimes developers use an obsolete approach to using connections, and once support of it is removed, upgrading the library may result in bugs like that.

Also, it is a little confusing that you list several versions of pg, pg-pool and pg-promise, because if you use pg-promise, you do not include anything else in your project.

johanneswuerbach · 2018-12-01T22:56:14Z

Yes, those are only sub-dependencies, which have been updated from version x ~> version y. I listed them here as each of them could have potentially introduced this.

sometimes developers use an absolute approach to using connections

Absolutely, but afaik we only use https://github.com/vitaly-t/pg-promise#transactions with async/await or db.{oneOrNone,manyOrNone,result} and never manage connections explicitly on in our code.

vitaly-t · 2018-12-02T00:01:28Z

Misspelled - I meant obsolete approach 😄

and never manage connections explicitly on in our code.

That's a good start 😄

I hope you will find from which version your code stopped working 😉

Maybe you've got a leech somewhere, like a module that after upgrading no longer releases connections correctly.

skipjack · 2018-12-02T18:05:59Z

I think I may be seeing something similar on [email protected]. I'm using pg with node-cron and on the first execution of the task all the sql queries fire perfectly. However, come the second execution (I'm using 30 second intervals for testing), it either hangs or errors out with:

Error: timeout exceeded when trying to connect
    at Timeout.setTimeout [as _onTimeout] (/.../node_modules/pg-pool/index.js:178:27)
    at ontimeout (timers.js:478:11)
    at tryOnTimeout (timers.js:302:5)
    at Timer.listOnTimeout (timers.js:262:5)

if I have connectionTimeoutMillis set. The code looks roughly like this:

cron.js

const { CronJob } = require('cron')
const { someJob } = require('./jobs')

new CronJob('*/30 * * * * *', someJob, null, true)

jobs.js

const { Pool } = require('pg')

// Passing `connectionTimeoutMillis: 5000` here makes it error out instead of hanging
const pool = new Pool()

const someJob = async () => {
    // This is the line that hangs or errors on the second execution
    let { rows } = await pool.query('...', [ ... ])
    
    rows.forEach(row => {
          let client = await pool.connect()

          try {
              await client.query('...')
              await client.query('COMMIT')
              
          } catch (err) {
              await client.query('ROLLBACK')
          }
    })
}

The above example is obviously contrived but it captures my underlying usage. I can try to throw together a reproducible repo but I'm not sure how what db would hit since I obviously can't expose the service I'm working on.

Aside from this cron job, pools have worked fine for me throughout the rest of the application I'm building (a koa-based api). I have read through the docs to see if I'm missing some cleanup steps or something and tried a few things like pool.end all to no avail. Any advice would be much appreciated. I'm fairly new to pg and postgres in general so I'm not sure where to go next from the timeout exceeded when trying to connect error.

skipjack · 2018-12-02T18:42:01Z

The issue (for me at least) was that I wasn't calling client.release() in a finally at the end of the try ... catch. @johanneswuerbach if this isn't your issue, I'm happy to delete these comments so you don't have extra noise on your ticket (sorry bout that 😞).

I noticed this after seeing the default max of 10. Once let client = await pool.connect() was called ten times I started seeing the timeout errors and I confirmed by bumping the max and seeing more fire before failing. @brianc @vitaly-t would there be a way for this package to throw an error like the following when the max is exceeded (rather than the cryptic timeouts):

Error: You've exceeded the maximum number of clients in your pool. Make sure to call `client.release()` to release clients back into the pool and bump `max` to more than 10 if you actually need 10+ clients in the pool at once.

johanneswuerbach · 2018-12-02T20:42:52Z

@skipjack yes, we might have a similar missing release somewhere, but sadly I haven't found anything so far.

Especially as usually (and before the upgrade) everything just works ™️ , but at some pointing something is killing the pool. Below an example of a process generally serving ~100-200 req/min until the pool is gone and only 500s caused by timeout exceeded when trying to connect are returned.

johanneswuerbach · 2018-12-02T21:43:32Z

Found two potential candidates brianc/node-pg-pool#109 🎉

skipjack · 2018-12-02T22:35:01Z

@johanneswuerbach nice. Yeah so I think we were running into the the same issue (the pool gets filled and then timeouts are thrown) but with different root causes. In my case, I was missing the client.release() call. In yours, it sounds like an actual bug since you verified client.release() is called and the issue arose after a picking up a new pg version. Anyways, hope your pr resolves whatever's causing the bug.

johanneswuerbach · 2018-12-10T04:44:34Z

We are running with brianc/node-pg-pool#109 in production now for some time and it looks like this fully resolved our issues.

johanneswuerbach · 2019-04-20T20:10:21Z

We are running latest versions of all libraries now and no longer see this issue.

johanneswuerbach closed this as completed Apr 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection leak? #1777

Connection leak? #1777

johanneswuerbach commented Dec 1, 2018

vitaly-t commented Dec 1, 2018 •

edited

Loading

johanneswuerbach commented Dec 1, 2018

vitaly-t commented Dec 1, 2018 •

edited

Loading

johanneswuerbach commented Dec 1, 2018

vitaly-t commented Dec 2, 2018 •

edited

Loading

skipjack commented Dec 2, 2018 •

edited

Loading

skipjack commented Dec 2, 2018 •

edited

Loading

johanneswuerbach commented Dec 2, 2018

johanneswuerbach commented Dec 2, 2018

skipjack commented Dec 2, 2018

johanneswuerbach commented Dec 10, 2018

johanneswuerbach commented Apr 20, 2019

Connection leak? #1777

Connection leak? #1777

Comments

johanneswuerbach commented Dec 1, 2018

vitaly-t commented Dec 1, 2018 • edited Loading

johanneswuerbach commented Dec 1, 2018

vitaly-t commented Dec 1, 2018 • edited Loading

johanneswuerbach commented Dec 1, 2018

vitaly-t commented Dec 2, 2018 • edited Loading

skipjack commented Dec 2, 2018 • edited Loading

skipjack commented Dec 2, 2018 • edited Loading

johanneswuerbach commented Dec 2, 2018

johanneswuerbach commented Dec 2, 2018

skipjack commented Dec 2, 2018

johanneswuerbach commented Dec 10, 2018

johanneswuerbach commented Apr 20, 2019

vitaly-t commented Dec 1, 2018 •

edited

Loading

vitaly-t commented Dec 1, 2018 •

edited

Loading

vitaly-t commented Dec 2, 2018 •

edited

Loading

skipjack commented Dec 2, 2018 •

edited

Loading

skipjack commented Dec 2, 2018 •

edited

Loading