Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aiopg times out while attempting to connect if falling back to an IPv4 address #907

Open
1 task done
thanatos opened this issue Apr 22, 2024 · 2 comments
Open
1 task done
Labels

Comments

@thanatos
Copy link

thanatos commented Apr 22, 2024

Describe the bug

While attempting to connect to CockroachDB, aiopg raises TimeoutError.

To Reproduce

  1. Run CockroachDB on localhost. I am running v22.2.19 on Linux.
  2. Execute the minimal example below.
import argparse
import asyncio
import logging
from pathlib import Path

import aiopg


async def test_db_pool(dbname: str, username: str, certs: Path):
    logging.info('Creating test connection.')
    kwargs = {
        'host': 'localhost',
        'port': 26257,
        'dbname': dbname,
        'user': username,
        'sslmode': 'verify-full',
        'sslcert': certs / f'client.{username}.crt',
        'sslkey': certs / f'client.{username}.key',
        'sslrootcert': certs / 'ca.crt',
    }
    async with aiopg.connect(**kwargs) as pool:
        logging.info('Connection ready.')


def main():
    logging.basicConfig(level=logging.DEBUG)
    parser = argparse.ArgumentParser()
    parser.add_argument('--certs-dir', action='store')
    args = parser.parse_args()

    asyncio.run(test_db_pool('postgres', 'root', Path(args.certs_dir)))


if __name__ == '__main__':
    main()

Expected behavior

The example exits successfully.

Logs/tracebacks

DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Creating test connection.
Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/tasks.py", line 490, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/connection.py", line 881, in _poll
    await asyncio.wait_for(self._waiter, timeout)
  File "/usr/lib/python3.11/asyncio/tasks.py", line 492, in wait_for
    raise exceptions.TimeoutError() from exc
TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/code/example/./aio_fail.py", line 35, in <module>
    main()
  File "/code/example/./aio_fail.py", line 31, in main
    asyncio.run(test_db_pool('postgres', 'root', Path(args.certs_dir)))
  File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/code/example/./aio_fail.py", line 21, in test_db_pool
    async with aiopg.connect(**kwargs) as pool:
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/utils.py", line 82, in __aenter__
    self._obj = await self._coro
                ^^^^^^^^^^^^^^^^
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/connection.py", line 1225, in _connect
    await self._poll(self._waiter, self._timeout)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/connection.py", line 883, in _poll
    await asyncio.shield(self.close())
                         ^^^^^^^^^^^^
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/connection.py", line 995, in close
    self._close()
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/connection.py", line 977, in _close
    self._loop.remove_reader(self._fileno)
  File "/usr/lib/python3.11/asyncio/selector_events.py", line 345, in remove_reader
    return self._remove_reader(fd)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/selector_events.py", line 290, in _remove_reader
    self._selector.modify(fd, mask, (None, writer))
  File "/usr/lib/python3.11/selectors.py", line 389, in modify
    self._selector.modify(key.fd, selector_events)
FileNotFoundError: [Errno 2] No such file or directory

Python Version

$ python --version
Python 3.11.5

aiopg Version

$ python -m pip show aiopg
Name: aiopg
Version: 1.4.0
Summary: Postgres integration with asyncio.
Home-page: https://aiopg.readthedocs.io
Author: Andrew Svetlov
Author-email: [email protected]
License: BSD
Location: /home/royiv/code/groceries/venv/lib/python3.11/site-packages
Requires: async-timeout, psycopg2-binary
Required-by: 

OS

Arch Linux

Additional context

Cockroach is a distributed database that uses the PostgreSQL wire protocol for queries. psql is capable of connecting to it, and executing queries. (So is psycopg2.)

When running the example, there is a long pause after Creating test connection., approximately 60s.

aiopg worked a few CockroachDB versions prior to this one, but I am sorry, I did not record the working version's number.

A packet captures shows aiopg makes the connection (the 3-way handshake happens) but no data is ever transmitted. This is a bit surprising, since at this point I am not sure what Cockroach could have done to have acted differently from vanilla PG.

Code of Conduct

  • I agree to follow the aio-libs Code of Conduct
@thanatos thanatos added the bug label Apr 22, 2024
@thanatos
Copy link
Author

since at this point I am not sure what Cockroach could have done to have acted differently from vanilla PG.

I can replicate this with vanilla Postgres & aiopg, so Cockroach can be removed from play here.

The difference is that PG listens, by default, to "localhost", and will bind to ::1 on that interface.

::1 is the first address aiopg will attempt, and the connection is made.

My CockroachDB instance, OTOH, is bound to 127.0.0.1 specifically. aiopg appears to try ::1 first (same as above with PG); this connection fails, as one would expect. aiopg correctly goes on to try 127.0.0.1, and that connection is made successfully, but aiopg hangs for some reason.

We can replicate this with vanilla PG by binding to 127.0.0.1 explicitly, like Cockroach was:

(in postgresql.conf)

listen_addresses = '127.0.0.1'

And then some slight tweaks to the example:

A slightly different example
import argparse
import asyncio
import logging
from pathlib import Path

import aiopg


async def test_db_pool(dbname: str, username: str, certs: Path):
    logging.info('Creating test connection.')
    kwargs = {
        'host': 'localhost',
        #'port': 26257,
        'port': 5432,
        'dbname': dbname,
        #'user': username,
        'user': 'my-username',
        #'sslmode': 'verify-full',
        #'sslcert': certs / f'client.{username}.crt',
        #'sslkey': certs / f'client.{username}.key',
        #'sslrootcert': certs / 'ca.crt',
    }
    async with aiopg.connect(**kwargs) as pool:
        logging.info('Connection ready.')


def main():
    logging.basicConfig(level=logging.DEBUG)
    parser = argparse.ArgumentParser()
    parser.add_argument('--certs-dir', action='store')
    args = parser.parse_args()

    asyncio.run(test_db_pool('postgres', 'root', Path(args.certs_dir)))


if __name__ == '__main__':
    main()

Now knowing this, we can get back to Cockroach: I had been passing --listen-addr localhost; this (apparently; this seems to be a bunch in Cockroach) causes it to bind (oddly) to 127.0.0.1. If I explicitly tell it to bind to --listen-addr '[::1]', then it binds to ::1, and aiopg doesn't have to try to fall back, so we avoid the bug.

But it's "falling back to attempting to connect over IPv4" that is the critical bit, not Cockroach.
And we hit the same issue, now with PG.

@thanatos thanatos changed the title aiopg times out while attempting to connect to CockroachDB aiopg times out while attempting to connect if falling back to an IPv4 address Apr 22, 2024
@thanatos
Copy link
Author

My working theory on this issue is that aiopg isn't managing the async events it is waiting on properly.

We need to call <pg conn>.poll(), during which psycopg2 will evaluate the state of the socket; it returns a value indicating what event we need to wait on.

aiopg/aiopg/connection.py

Lines 835 to 840 in 7b01833

elif state == psycopg2.extensions.POLL_WRITE:
if not self._writing:
self._loop.add_writer(
self._fileno, self._ready, weak_self # type: ignore
)
self._writing = True

I think the "problem" is that if not self._writing. Normally, we're attempting to make sure that we've not already got a loop.add_writer call pending on this socket. In that regard, the if is sensible. What I think is happening is that during poll(), PG is going to close & reopen the socket — to switch to an IPv4 connection — and when it does this, we lose our add_writer callback. (I assume b/c underneath, it's just configuring an epollfd, and that epollfd knows the socket is closed and drops the events.) However, the if prevents us from re-establishing the handler.

I added some debugging to see when _ready is invoked, and the actual state of the socket before/after poll, and what we decide to then do in this iteration of _ready:

ready(); from _poll
  sock (  me) = ('::1', 48080, 0, 0)
ready(): poll ok
  fileno, us = 6, inner = 6
  sock (  me) = ('::1', 48080, 0, 0)
ready(): POLL_WRITE
ready(): POLL_WRITE -- doing add_writer

ready(); add_reader
  sock (  me) = ('::1', 48080, 0, 0)
ready(): poll ok
  fileno, us = 6, inner = 6
  sock (peer) = ('127.0.0.1', 5432)
  sock (  me) = ('127.0.0.1', 48760)
ready(): POLL_WRITE

ready(); add_writer … ('::1', 48080, 0, 0)
  sock (  me) = ('127.0.0.1', 48760)
ready(): poll ok
  fileno, us = 6, inner = 6
  sock (peer) = ('127.0.0.1', 5432)
  sock (  me) = ('127.0.0.1', 48760)
ready(): POLL_WRITE

I'm not 100% convinced of this yet. Even if it's correct … boy do I not know how we're to fix it! It seems like the interface w/ PG doesn't give us the info we need to know that the FD has been swapped out. It's all good and fine if you're using select(2), like psycopg2's docs use in their examples, but if you're using epoll(2) … I worry you need to know that the socket has been recreated & your epoll triggers need to be recreated. And asyncio is going to be using epoll(2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant