fix: treat io.ErrUnexpectedEOF as driver.ErrBadConn to prevent connection pool poisoning#1299
Conversation
|
Can't this use the existing pqtest.Fake to test this error? Over 250 lines of code for a single test is quite a lot to maintain going forward, and it's not immediately obvious what it does exactly either. I'm also okay with just running this against a real database. I made #1304 to run it in the CI, although I need to finish up some bits. Is changing errQueryInProgress intentional or a left-over from some previous debugging? |
fa858df to
e2a6281
Compare
|
@arp242 Oops yeah that was left behind, thanks. Pushed an update that uses pqtest instead which reduces it a lot. If you'd prefer not having a test for this case entirely, I can just remove it. It was mostly to provide a in-repo example of the demo I linked to that reproduces the case in full. The check in |
e2a6281 to
205ea6b
Compare
When recvMessage does an io.ReadFull on a partially-received message body and the connection drops mid-read, the result is io.ErrUnexpectedEOF. handleError classifies io.EOF as driver.ErrBadConn but not io.ErrUnexpectedEOF, so cn.err is never set, IsValid() returns true, and database/sql keeps recycling the broken connection. The inProgress flag stays stuck at true (ReadyForQuery never arrived), and the CAS guard rejects every subsequent query with "there is already a query being processed on this connection" — permanently poisoning the pool. So treat io.ErrUnexpectedEOF the same as io.EOF in handleError: both indicate a dead connection. Fixes lib#1298
205ea6b to
6cc34d3
Compare
|
Thanks! |
Fixes #1298
When a TCP read is interrupted mid-message (partial DataRow), pq's
recvMessagereturnsio.ErrUnexpectedEOF. Prior to this fix,handleErrordid not classify this asdriver.ErrBadConn, so:cn.errwas never set, andIsValid()returned trueinProgressatomic flag remained stuck at true (since ReadyForQuerynever arrived)
database/sqlkept handing out the broken connection"there is already a query being processed on this connection"
This is especially impactful with CockroachDB, where node drains and rebalances routinely produce partial reads.
Two changes:
error.go: Treatio.ErrUnexpectedEOFthe same asio.EOFinhandleError- both indicate a dead connection.conn.go: WraperrQueryInProgresswithdriver.ErrBadConnsodatabase/sqlevicts the connection instead of recycling it. This acts as defense for any future code path that leavesinProgressstuck.Includes unit test for
handleErrorand integration test using a TCP fault-injection proxy that truncates a DataRow mid-body to reproduce the exact failure mode.