Replace StaticPool with QueuePool and add robust connection pooling options #7829

seongsukwon-moreh · 2025-11-03T09:12:05Z

This PR addresses database connection stability issues in db_utils.py. The server was frequently experiencing intermittent connection failures, most notably:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly...

This error also sometimes manifested on the web dashboard:

The changes are as follows:

Replace StaticPool with QueuePool for _max_connections == 1:
- Problem: The initial attempt to fix stale connections by adding pool_pre_ping=True to StaticPool exposed this issue, immediately causing psycopg2.ProgrammingError: set_session cannot be used inside a transaction.
- Fix: Replaced StaticPool with QueuePool(pool_size=1). This maintains the intent of using a single connection "at rest" while gaining the critical safety of QueuePool's connection reset (e.g., rollback()) logic on check-in.
Introduce max_overflow=5 for the _max_connections == 1 case:
- Problem: Simply using QueuePool(pool_size=1, max_overflow=0) revealed a new issue: sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 0 reached.... This proves the application does require more than one concurrent connection during brief periods of load.
- Fix: Set max_overflow=5. This respects the spirit of _max_connections=1 (keeping a small resting pool) while allowing the application to handle real-world concurrent bursts (up to 1+5=6 connections) without timing out.
Add Stability Options (pre_ping, recycle) to all QueuePools:
- Problem: Long-running connections can become "stale" (e.g., due to network firewall timeouts or database restarts), leading to errors on reuse.
- Fix: Added pool_pre_ping=True (to validate connections before use) and pool_recycle=1800 (to refresh connections every 30 minutes) to all QueuePool configurations.
Standardize Parameter Name (size -> pool_size):
- Problem: The else block used the parameter size. This causes a TypeError if mixed with other pool_ prefixed arguments (like pool_recycle).
- Fix: Standardized on pool_size=_max_connections for consistency and to prevent parameter-mixing errors.

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

… pool_recycle to ensure db connection alive

SeungjinYang

Thanks for the change @seongsukwon-moreh! I left a couple of minor comments.

One interesting additional improvement to consider (on this PR or otherwise) is how the engine is created in async case, which still uses rather inefficient NullPool:

        if async_engine:
            conn_string = conn_string.replace('postgresql://',
                                              'postgresql+asyncpg://')
            # This is an AsyncEngine, instead of a (normal, synchronous) Engine,
            # so we should not put it in the cache. Instead, just return.
            return sqlalchemy_async.create_async_engine(
                conn_string, poolclass=sqlalchemy.NullPool)

We won't be able to use QueuePool for it because QueuePool does not support asyncio, but SqlAlchemy does provide AsyncAdaptedQueuePool for this purpose.

SeungjinYang · 2025-11-03T19:14:47Z

sky/utils/db/db_utils.py

                        sqlalchemy.create_engine(
-                            conn_string, poolclass=sqlalchemy.pool.StaticPool))
+                            conn_string,
+                            poolclass=sqlalchemy.pool.QueuePool,
+                            pool_size=1,
+                            max_overflow=5,
+                            pool_pre_ping=True,
+                            pool_recycle=1800))
                else:
                    _postgres_engine_cache[conn_string] = (
                        sqlalchemy.create_engine(
                            conn_string,
                            poolclass=sqlalchemy.pool.QueuePool,
-                            size=_max_connections,
-                            max_overflow=0))
+                            pool_size=_max_connections,
+                            max_overflow=0,
+                            pool_pre_ping=True,
+                            pool_recycle=1800))


Now that we're using QueuePool for both cases, we should be able to collapse the if statement down to couple of parameters (pool_size and max_overflow) instead of duplicating the entire statement.

One thing that could be interesting re: comment above: For the case where _max_connections > 1, we can perhaps dynamically adjust max_overflow to be max(0, 5 - _max_connections) so we always guarantee the pool can scale up to a certain number of connections (in this case 5) without providing unnecessary overflow in case where one isn't needed.

I've updated the code based on your suggestions. Please review it.

SeungjinYang · 2025-11-03T19:28:42Z

/smoke-test --aws -k basic --postgres (passed)

SeungjinYang · 2025-11-04T04:43:09Z

/smoke-test --aws -k basic --postgres

SeungjinYang · 2025-11-04T05:24:10Z

Thanks! Merging now.

Replace StaticPool as QueuePool with pool_size=1, add pool_pre_ping &…

b0aadb1

… pool_recycle to ensure db connection alive

seongsukwon-moreh marked this pull request as ready for review November 3, 2025 09:20

SeungjinYang self-requested a review November 3, 2025 18:57

SeungjinYang approved these changes Nov 3, 2025

View reviewed changes

Collapse duplicate QueuePool statements with max().

38cd14c

seongsukwon-moreh requested a review from SeungjinYang November 4, 2025 03:35

SeungjinYang merged commit d99426d into skypilot-org:master Nov 4, 2025
21 checks passed

seongsukwon-moreh deleted the fix_db_connection_error branch November 4, 2025 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace StaticPool with QueuePool and add robust connection pooling options #7829

Replace StaticPool with QueuePool and add robust connection pooling options #7829

seongsukwon-moreh commented Nov 3, 2025

Uh oh!

SeungjinYang left a comment •

edited

Loading

Uh oh!

SeungjinYang Nov 3, 2025

Uh oh!

SeungjinYang Nov 3, 2025

Uh oh!

seongsukwon-moreh Nov 4, 2025

Uh oh!

SeungjinYang commented Nov 3, 2025 •

edited

Loading

Uh oh!

SeungjinYang commented Nov 4, 2025

Uh oh!

SeungjinYang commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replace StaticPool with QueuePool and add robust connection pooling options #7829

Replace StaticPool with QueuePool and add robust connection pooling options #7829

Conversation

seongsukwon-moreh commented Nov 3, 2025

Uh oh!

SeungjinYang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SeungjinYang Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

SeungjinYang Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

seongsukwon-moreh Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

SeungjinYang commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeungjinYang commented Nov 4, 2025

Uh oh!

SeungjinYang commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SeungjinYang left a comment •

edited

Loading

SeungjinYang commented Nov 3, 2025 •

edited

Loading