Skip to content

Conversation

@seongsukwon-moreh
Copy link
Contributor

This PR addresses database connection stability issues in db_utils.py. The server was frequently experiencing intermittent connection failures, most notably:

  • sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly...

This error also sometimes manifested on the web dashboard:

image

The changes are as follows:

  1. Replace StaticPool with QueuePool for _max_connections == 1:

    • Problem: The initial attempt to fix stale connections by adding pool_pre_ping=True to StaticPool exposed this issue, immediately causing psycopg2.ProgrammingError: set_session cannot be used inside a transaction.
    • Fix: Replaced StaticPool with QueuePool(pool_size=1). This maintains the intent of using a single connection "at rest" while gaining the critical safety of QueuePool's connection reset (e.g., rollback()) logic on check-in.
  2. Introduce max_overflow=5 for the _max_connections == 1 case:

    • Problem: Simply using QueuePool(pool_size=1, max_overflow=0) revealed a new issue: sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 0 reached.... This proves the application does require more than one concurrent connection during brief periods of load.
    • Fix: Set max_overflow=5. This respects the spirit of _max_connections=1 (keeping a small resting pool) while allowing the application to handle real-world concurrent bursts (up to 1+5=6 connections) without timing out.
  3. Add Stability Options (pre_ping, recycle) to all QueuePools:

    • Problem: Long-running connections can become "stale" (e.g., due to network firewall timeouts or database restarts), leading to errors on reuse.
    • Fix: Added pool_pre_ping=True (to validate connections before use) and pool_recycle=1800 (to refresh connections every 30 minutes) to all QueuePool configurations.
  4. Standardize Parameter Name (size -> pool_size):

    • Problem: The else block used the parameter size. This causes a TypeError if mixed with other pool_ prefixed arguments (like pool_recycle).
    • Fix: Standardized on pool_size=_max_connections for consistency and to prevent parameter-mixing errors.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@seongsukwon-moreh seongsukwon-moreh marked this pull request as ready for review November 3, 2025 09:20
@SeungjinYang SeungjinYang self-requested a review November 3, 2025 18:57
Copy link
Collaborator

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change @seongsukwon-moreh! I left a couple of minor comments.

One interesting additional improvement to consider (on this PR or otherwise) is how the engine is created in async case, which still uses rather inefficient NullPool:

        if async_engine:
            conn_string = conn_string.replace('postgresql://',
                                              'postgresql+asyncpg://')
            # This is an AsyncEngine, instead of a (normal, synchronous) Engine,
            # so we should not put it in the cache. Instead, just return.
            return sqlalchemy_async.create_async_engine(
                conn_string, poolclass=sqlalchemy.NullPool)

We won't be able to use QueuePool for it because QueuePool does not support asyncio, but SqlAlchemy does provide AsyncAdaptedQueuePool for this purpose.

Comment on lines 449 to 464
sqlalchemy.create_engine(
conn_string, poolclass=sqlalchemy.pool.StaticPool))
conn_string,
poolclass=sqlalchemy.pool.QueuePool,
pool_size=1,
max_overflow=5,
pool_pre_ping=True,
pool_recycle=1800))
else:
_postgres_engine_cache[conn_string] = (
sqlalchemy.create_engine(
conn_string,
poolclass=sqlalchemy.pool.QueuePool,
size=_max_connections,
max_overflow=0))
pool_size=_max_connections,
max_overflow=0,
pool_pre_ping=True,
pool_recycle=1800))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're using QueuePool for both cases, we should be able to collapse the if statement down to couple of parameters (pool_size and max_overflow) instead of duplicating the entire statement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that could be interesting re: comment above: For the case where _max_connections > 1, we can perhaps dynamically adjust max_overflow to be max(0, 5 - _max_connections) so we always guarantee the pool can scale up to a certain number of connections (in this case 5) without providing unnecessary overflow in case where one isn't needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code based on your suggestions. Please review it.

@SeungjinYang
Copy link
Collaborator

SeungjinYang commented Nov 3, 2025

/smoke-test --aws -k basic --postgres (passed)

@SeungjinYang
Copy link
Collaborator

/smoke-test --aws -k basic --postgres

@SeungjinYang
Copy link
Collaborator

Thanks! Merging now.

@SeungjinYang SeungjinYang merged commit d99426d into skypilot-org:master Nov 4, 2025
21 checks passed
@seongsukwon-moreh seongsukwon-moreh deleted the fix_db_connection_error branch November 4, 2025 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants