Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_lr_with_slow_safekeeper is flaky because logical_replication_sync not waiting for tablesync #10242

Open
alexanderlaw opened this issue Dec 25, 2024 · 1 comment
Labels
t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link

Multiple failures of test_lr_with_slow_safekeeper, e.g.:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10238/12486246775/index.html#/testresult/455c25d1ecd98aac
with the following diagnostics:

test_runner/regress/test_logical_replication.py:317: in test_lr_with_slow_safekeeper
    assert [r[0] for r in vanilla_pg.safe_psql("select * from t")] == [1]
E   assert [] == [1]
E     Right contains one more item: 1
E     Full diff:
E     - [1]
E     ?  -
E     + []

pgdata-vanilla/pg.log:

2024-12-24 21:29:09.658 UTC [92514] LOG:  logical replication apply worker for subscription "sub1" has started
2024-12-24 21:29:09.666 UTC [92226] LOG:  received fast shutdown request
2024-12-24 21:29:09.666 UTC [92576] LOG:  logical replication table synchronization worker for subscription "sub1", table "t" has started
2024-12-24 21:29:09.666 UTC [92226] LOG:  aborting any active transactions
2024-12-24 21:29:09.666 UTC [92576] FATAL:  terminating logical replication worker due to administrator command

Or https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10238/12486246775/index.html#/testresult/4efed0c5fa7cc95b

test_runner/regress/test_logical_replication.py:317: in test_lr_with_slow_safekeeper
    assert [r[0] for r in vanilla_pg.safe_psql("select * from t")] == [1]
E   assert [] == [1]
E     Right contains one more item: 1
E     Full diff:
E     - [1]
E     ?  -
E     + []

pgdata-vanilla/pg.log:

2024-12-24 21:29:04.111 UTC [88099] LOG:  logical replication apply worker for subscription "sub1" has started
2024-12-24 21:29:04.123 UTC [88067] LOG:  received fast shutdown request
2024-12-24 21:29:04.126 UTC [88067] LOG:  aborting any active transactions
2024-12-24 21:29:04.126 UTC [88099] FATAL:  terminating logical replication worker due to administrator command

show that the logical_replication_sync call in the test:

    vanilla_pg.safe_psql("create table t(a int)")
    connstr = endpoint.connstr().replace("'", "''")
    vanilla_pg.safe_psql(f"create subscription sub1 connection '{connstr}' publication pub")
    logical_replication_sync(vanilla_pg, endpoint)

    vanilla_pg.stop()

doesn't wait for the table synchronization to finish. On a successful run, pg.log contains:

2024-12-25 14:33:55.575 EET [2152526] LOG:  database system is ready to accept connections
2024-12-25 14:33:55.868 EET [2152610] LOG:  logical replication apply worker for subscription "sub1" has started
2024-12-25 14:33:55.918 EET [2152625] LOG:  logical replication table synchronization worker for subscription "sub1", table "t" has started
2024-12-25 14:33:55.952 EET [2152625] LOG:  logical replication table synchronization worker for subscription "sub1", table "t" has finished
2024-12-25 14:33:56.420 EET [2152526] LOG:  received fast shutdown request

This test failure can be easily reproduced with a sleep added inside TablesyncWorkerMain():

@@ -1714,6 +1714,7 @@ TablesyncWorkerMain(Datum main_arg)
 
     SetupApplyOrSyncWorker(worker_slot);
 
+pg_usleep(1000000);
     run_tablesync_worker();
@alexanderlaw alexanderlaw added the t/bug Issue Type: Bug label Dec 25, 2024
@github-actions github-actions bot added the external A PR or Issue is created by an external user label Dec 25, 2024
@alexanderlaw
Copy link
Author

I've discovered that the above recipe for reproducing doesn't work on each test run because of another issue. With the additional logging added to logical_replication_sync:

@@ -4927,6 +4928,10 @@ def logical_replication_sync(subscriber: PgProtocol, publisher: PgProtocol) -> L
     """Wait logical replication subscriber to sync with publisher."""
     publisher_lsn = Lsn(publisher.safe_psql("SELECT pg_current_wal_flush_lsn()")[0][0])
     while True:
+        res = subscriber.safe_psql("select * from pg_catalog.pg_stat_subscription")
+        log.info(f"pg_stat_subscription: {res}")
+        res = subscriber.safe_psql("select latest_end_lsn from pg_catalog.pg_stat_subscription")
+        log.info(f"latest_end_lsn(s): {res}")
         res = subscriber.safe_psql("select latest_end_lsn from pg_catalog.pg_stat_subscription")[0][

I'm seeing the following messages (with line breaks added) in the test log:

2024-12-28 07:57:19.668 INFO [neon_fixtures.py:300] Executing query: create table t(a int)
2024-12-28 07:57:19.743 INFO [neon_fixtures.py:300] Executing query: create subscription sub1 connection 'host=localhost port=16016
 user=cloud_admin dbname=postgres options=''-cstatement_timeout=120s ''' publication pub
2024-12-28 07:57:19.779 INFO [neon_fixtures.py:300] Executing query: SELECT pg_current_wal_flush_lsn()
2024-12-28 07:57:19.784 INFO [neon_fixtures.py:300] Executing query: select * from pg_catalog.pg_stat_subscription
2024-12-28 07:57:19.785 INFO [neon_fixtures.py:4932] pg_stat_subscription: 
[(16387, 'sub1', 'apply', 2698456, None, None, '0/292FCC0', datetime.datetime(2024, 12, 28, 7, 57, 19, 784847, 
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), datetime.datetime(2024, 12, 28, 7, 57, 19, 784988,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), '0/292FCC0', datetime.datetime(2024, 12, 28, 7, 57, 19, 784847,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))))]
2024-12-28 07:57:19.788 INFO [neon_fixtures.py:300] Executing query: select latest_end_lsn from pg_catalog.pg_stat_subscription
2024-12-28 07:57:19.789 INFO [neon_fixtures.py:4934] latest_end_lsn(s): [(None,), ('0/292FCC0',)]
2024-12-28 07:57:19.792 INFO [neon_fixtures.py:300] Executing query: select latest_end_lsn from pg_catalog.pg_stat_subscription
2024-12-28 07:57:20.296 INFO [neon_fixtures.py:300] Executing query: select * from pg_catalog.pg_stat_subscription
2024-12-28 07:57:20.298 INFO [neon_fixtures.py:4932] pg_stat_subscription: 
[
(16387, 'sub1', 'table synchronization', 2698460, None, 16384, None, datetime.datetime(2024, 12, 28, 7, 57, 19, 785883,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), datetime.datetime(2024, 12, 28, 7, 57, 19, 785883,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), None, datetime.datetime(2024, 12, 28, 7, 57, 19, 785883,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))),
(16387, 'sub1', 'apply', 2698456, None, None, '0/292FCC0', datetime.datetime(2024, 12, 28, 7, 57, 19, 784847,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), datetime.datetime(2024, 12, 28, 7, 57, 19, 784988,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), '0/292FCC0', datetime.datetime(2024, 12, 28, 7, 57, 19, 784847,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))))]
2024-12-28 07:57:20.300 INFO [neon_fixtures.py:300] Executing query: select latest_end_lsn from pg_catalog.pg_stat_subscription
2024-12-28 07:57:20.302 INFO [neon_fixtures.py:4934] latest_end_lsn(s): [(None,), ('0/292FCC0',)]
2024-12-28 07:57:20.305 INFO [neon_fixtures.py:300] Executing query: select latest_end_lsn from pg_catalog.pg_stat_subscription
2024-12-28 07:57:20.809 INFO [neon_fixtures.py:300] Executing query: select * from pg_catalog.pg_stat_subscription
2024-12-28 07:57:20.810 INFO [neon_fixtures.py:4932] pg_stat_subscription:
[
(16387, 'sub1', 'table synchronization', 2698460, None, 16384, '0/29301B0', datetime.datetime(2024, 12, 28, 7, 57, 20, 810037,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), datetime.datetime(2024, 12, 28, 7, 57, 20, 810057,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), '0/29301B0', datetime.datetime(2024, 12, 28, 7, 57, 20, 810037,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))),
(16387, 'sub1', 'apply', 2698456, None, None, '0/29301B0', datetime.datetime(2024, 12, 28, 7, 57, 20, 793298,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), datetime.datetime(2024, 12, 28, 7, 57, 20, 793393,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))), '0/29301B0', datetime.datetime(2024, 12, 28, 7, 57, 20, 793298,
 tzinfo=datetime.timezone(datetime.timedelta(seconds=7200))))]
2024-12-28 07:57:20.813 INFO [neon_fixtures.py:300] Executing query: select latest_end_lsn from pg_catalog.pg_stat_subscription
2024-12-28 07:57:20.814 INFO [neon_fixtures.py:4934] latest_end_lsn(s): [('0/2930228',)]
2024-12-28 07:57:20.817 INFO [neon_fixtures.py:300] Executing query: select latest_end_lsn from pg_catalog.pg_stat_subscription
2024-12-28 07:57:20.818 INFO [neon_fixtures.py:4939] subscriber_lsn=0/2930228
2024-12-28 07:57:20.818 INFO [neon_fixtures.py:4941] Subscriber LSN=0/2930228, publisher LSN=0/292FCC0
2024-12-28 07:57:20.818 INFO [neon_fixtures.py:2901] Running command '.../neon/pg_install/v17/bin/pg_ctl -w
 -D .../neon/test_output/test_lr_with_slow_safekeeper[release-pg17]/pgdata-vanilla stop'

That is, "select * from pg_catalog.pg_stat_subscription" returns two records:

(16387, 'sub1', 'table synchronization', 2698460, None, 16384, None, ...),
(16387, 'sub1', 'apply', 2698456, None, None, '0/292FCC0', ...)

and the first record contains latest_end_lsn == None, so code below the check "if res:" is not reached in this case until the "table synchronization" disappears. But as the query lacks "order by", the "apply" record can be the first as well and then the query might fail with the sleep added (or without, due to concurrent load).

"select latest_end_lsn from pg_catalog.pg_stat_subscription where latest_end_lsn is not null" makes the test work (fail) deterministically.

@jcsp jcsp removed the external A PR or Issue is created by an external user label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants