-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_lr_with_slow_safekeeper is flaky because logical_replication_sync not waiting for tablesync #10242
Comments
I've discovered that the above recipe for reproducing doesn't work on each test run because of another issue. With the additional logging added to logical_replication_sync: @@ -4927,6 +4928,10 @@ def logical_replication_sync(subscriber: PgProtocol, publisher: PgProtocol) -> L
"""Wait logical replication subscriber to sync with publisher."""
publisher_lsn = Lsn(publisher.safe_psql("SELECT pg_current_wal_flush_lsn()")[0][0])
while True:
+ res = subscriber.safe_psql("select * from pg_catalog.pg_stat_subscription")
+ log.info(f"pg_stat_subscription: {res}")
+ res = subscriber.safe_psql("select latest_end_lsn from pg_catalog.pg_stat_subscription")
+ log.info(f"latest_end_lsn(s): {res}")
res = subscriber.safe_psql("select latest_end_lsn from pg_catalog.pg_stat_subscription")[0][ I'm seeing the following messages (with line breaks added) in the test log:
That is, "select * from pg_catalog.pg_stat_subscription" returns two records:
and the first record contains latest_end_lsn == None, so code below the check "if res:" is not reached in this case until the "table synchronization" disappears. But as the query lacks "order by", the "apply" record can be the first as well and then the query might fail with the sleep added (or without, due to concurrent load). "select latest_end_lsn from pg_catalog.pg_stat_subscription where latest_end_lsn is not null" makes the test work (fail) deterministically. |
Multiple failures of test_lr_with_slow_safekeeper, e.g.:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10238/12486246775/index.html#/testresult/455c25d1ecd98aac
with the following diagnostics:
pgdata-vanilla/pg.log:
Or https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10238/12486246775/index.html#/testresult/4efed0c5fa7cc95b
pgdata-vanilla/pg.log:
show that the logical_replication_sync call in the test:
doesn't wait for the table synchronization to finish. On a successful run, pg.log contains:
This test failure can be easily reproduced with a sleep added inside TablesyncWorkerMain():
The text was updated successfully, but these errors were encountered: