NBD server losing connection to PostgreSQL #154

crabique · 2023-04-10T16:49:19Z

Describe the bug

When using benji nbd with configuration pointing to a PostgreSQL pool, the NBD server stops initiating new client connections after some period of inactivity.

The following errors can be found in the process output:

   INFO: Incoming connection from <IP>:60948.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [<IP>:60948]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
   DEBUG: [<IP>:60948]: opt=1, length=28, data=b'<UID>'
   ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=OperationalError('(psycopg2.OperationalError) server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n',)>

After that, all new connection requests cause the server to show something like that:

ERROR: [<IP>:44590] NBD_CMD_READ: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)

Traceback (most recent call last):

  File "/benji/lib64/python3.6/site-packages/benji/nbdserver.py", line 346, in handler
    data = self.store.read(version, cow_version, offset, length)
  File "/benji/lib64/python3.6/site-packages/benji/benji.py", line 1340, in read
    read_list = self._block_list(version, offset, length)
  File "/benji/lib64/python3.6/site-packages/benji/benji.py", line 1311, in _block_list
    block = version.get_block_by_idx(block_idx)
  File "/benji/lib64/python3.6/site-packages/benji/database.py", line 586, in get_block_by_idx
    block = Session.query(Block).filter(Block.version_id == self.id, Block.idx == idx).one_or_none()
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 2849, in one_or_none
    return self._iter().one_or_none()
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 2910, in _iter
    execution_options={"_sa_orm_load_options": self.load_options},
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1712, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    self, multiparams, params, execution_options
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1582, in _execute_clauseelement
    cache_hit=cache_hit,
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1798, in _execute_context
    conn = self._revalidate_connection()
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 645, in _revalidate_connection
    self._invalid_transaction()
  File "/benji/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 625, in _invalid_transaction
    code="8s2b",
sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)

This behaviour persists until the NBD server is restarted.

To Reproduce

Configure the following databaseEngine:

databaseEngine: postgresql://<PG_USER>:<PG_PASS>@<PG_HOST>/<BENJI_DB>

Try to connect to the NBD server using any client, then close the connection or keep it idle for 10-15 minutes (so that the server doesn't get any new requests), then try to connect again.

Expected behavior

NBD server works reliably without restarting it after every connection drop.

Platform and versions (please complete the following information):

benji v0.17.0
bitnami/[email protected] helm chart for the database

Additional context

Based on the suggestions here, I managed to make it self-recover in such cases by hotpatching this line in site-packages:

self._engine = sqlalchemy.create_engine(url, pool_pre_ping=True)

Still, this sometimes requires the client to initiate the connection 2+ times for the server to recover and actually accept it, but this more or less fixes the problem.

Please let me know if I you need any more details or if I can help in any way.

The text was updated successfully, but these errors were encountered:

elemental-lf · 2023-04-25T09:45:47Z

Thanks for the report! What error message(s) or warnings are you getting with your patch installed? Adding pool_pre_ping=True shouldn't hurt either way but we should try to figure out what we need to do to make the re-connection process completely seamless from the user's POV.

crabique · 2023-04-25T12:11:31Z

Session where it fails and then recovers even with pre-ping enabled:

    INFO: [<IP>:32832] disconnecting
   ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=OperationalError('(psycopg2.OperationalError) server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n',)>
    INFO: Incoming connection from <IP>:32834.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [<IP>:32834]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
   DEBUG: [<IP>:32834]: opt=1, length=28, data=b'<UID>'
    INFO: [<IP>:32834] Negotiated export: <UID>.
   ERROR: Task exception was never retrieved
future: <Task finished coro=<NbdServer.handler() done, defined at /benji/lib64/python3.6/site-packages/benji/nbdserver.py:176> exception=InternalError('Attempt to acquire lock Version <UID> twice.',)>
    INFO: Incoming connection from <IP>:32836.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [<IP>:32836]: opt=7, length=34, data=b'\x00\x00\x00\x1c<UID>\x00\x00'
   DEBUG: [<IP>:32836]: opt=1, length=28, data=b'<UID>'
    INFO: [<IP>:32836] Negotiated export: <UID>.

crabique · 2023-05-31T14:31:03Z

I saw some commits with relevant changes pushed to master, can this be considered resolved? And is there a release/tag I can try?

crabique · 2024-05-15T11:13:21Z

Turns out, the problem was my TCP proxy (ingress-nginx) killing the tcp-proxied PSQL connections after 600s (default stream-timeout value), it works perfectly when the ingress is taken out of the equation or when the timeout is increased.

However, when such timeouts happen, it seems the pool_pre_ping only fixes half the problem, the other half is that the operation fails (btw not only the NBD serve, but also long backup or restore, cleanup etc).

crabique added the bug Something isn't working label Apr 10, 2023

elemental-lf self-assigned this Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NBD server losing connection to PostgreSQL #154

NBD server losing connection to PostgreSQL #154

crabique commented Apr 10, 2023 •

edited

Loading

elemental-lf commented Apr 25, 2023

crabique commented Apr 25, 2023 •

edited

Loading

crabique commented May 31, 2023

crabique commented May 15, 2024 •

edited

Loading

NBD server losing connection to PostgreSQL #154

NBD server losing connection to PostgreSQL #154

Comments

crabique commented Apr 10, 2023 • edited Loading

elemental-lf commented Apr 25, 2023

crabique commented Apr 25, 2023 • edited Loading

crabique commented May 31, 2023

crabique commented May 15, 2024 • edited Loading

crabique commented Apr 10, 2023 •

edited

Loading

crabique commented Apr 25, 2023 •

edited

Loading

crabique commented May 15, 2024 •

edited

Loading