Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Background updates failed after upgrading from Synapse 1.25.0 #15677

Closed
jeeco5 opened this issue May 26, 2023 · 9 comments · Fixed by #15887
Closed

Background updates failed after upgrading from Synapse 1.25.0 #15677

jeeco5 opened this issue May 26, 2023 · 9 comments · Fixed by #15887
Assignees
Labels
O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@jeeco5
Copy link

jeeco5 commented May 26, 2023

Description

I upgraded my old synapse server (v1.25.0) a few days ago to the latest version of synapse (1.84.0). Today I discovered that some background migrations failed (see log). If I understand this correctly some database schema migrations have failed. Is there any chance I can fix this errors without destroying the whole server?

Steps to reproduce

  • have an old synapse (v1.25)
  • try to upgrade it directly to (v1.84)

Homeserver

private

Synapse Version

1.84.0

Installation Method

Docker (matrixdotorg/synapse)

Database

PostgreSQL, no seperate servers

Workers

Single process

Platform

Debian VPS

Configuration

No response

Relevant log output

2023-05-26 15:56:58,235 - synapse.metrics.background_process_metrics - 244 - ERROR - background_updates-0 - Background process 'background_updates' threw an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/background_updates.py", line 294, in run_background_updates
    result = await self.do_next_background_update(sleep)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/background_updates.py", line 424, in do_next_background_update
    await self._do_background_update(desired_duration_ms)
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/background_updates.py", line 467, in _do_background_update
    items_updated = await update_handler(progress, batch_size)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/databases/main/events_bg_updates.py", line 1313, in _background_replace_stream_ordering_column
    await self.db_pool.runInteraction(
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 925, in runInteraction
    return await delay_cancellation(_runInteraction())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1693, in _inlineCallbacks
    result = context.run(
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 891, in _runInteraction
    result = await self.runWithConnection(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 1020, in runWithConnection
    return await make_deferred_yieldable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/threadpool.py", line 244, in inContext
    result = inContext.theWork()  # type: ignore[attr-defined]
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/threadpool.py", line 260, in <lambda>
    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/context.py", line 117, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/context.py", line 82, in callWithContext
    return func(*args, **kw)
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/enterprise/adbapi.py", line 282, in _runWithConnection
    result = func(conn, *args, **kw)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 1013, in inner_func
    return func(db_conn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 753, in new_transaction
    r = func(cursor, *args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/databases/main/events_bg_updates.py", line 1301, in process
    txn.execute(sql)
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 417, in execute
    self._do_execute(self.txn.execute, sql, parameters)
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 469, in _do_execute
    return func(sql, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
psycopg2.errors.DependentObjectsStillExist: cannot drop column stream_ordering of table events because other objects depend on it
DETAIL:  constraint event_stream_ordering_fkey on table current_state_events depends on column stream_ordering of table events
constraint event_stream_ordering_fkey on table local_current_membership depends on column stream_ordering of table events
constraint event_stream_ordering_fkey on table room_memberships depends on column stream_ordering of table events
HINT:  Use DROP ... CASCADE to drop the dependent objects too.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/synapse/metrics/background_process_metrics.py", line 242, in run
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/background_updates.py", line 299, in run_background_updates
    raise RuntimeError(
RuntimeError: 5 back-to-back background update failures; aborting.

Anything else that would be useful to know?

No response

@jeeco5
Copy link
Author

jeeco5 commented May 26, 2023

Maybe some more context after digging some more into this. I checked the following tables:

synapse=# select * from background_updates;
                 update_name                  |                                   progress_json                                   |              depends_on               | ordering 
----------------------------------------------+-----------------------------------------------------------------------------------+---------------------------------------+----------
 replace_stream_ordering_column               | {}                                                                                | index_stream_ordering2_ts             |     6001
 populate_room_depth_min_depth2               | {}                                                                                |                                       |     6103
 replace_room_depth_min_depth                 | {}                                                                                | populate_room_depth2                  |     6103
 remove_deleted_email_pushers                 | {}                                                                                |                                       |     6302
 populate_rooms_creator_column                | {}                                                                                |                                       |     6302
 presence_stream_not_offline_index            | {}                                                                                |                                       |     6304
 event_arbitrary_relations                    | {}                                                                                |                                       |     6507
 remove_dead_devices_from_device_inbox        | {}                                                                                |                                       |     6508
 delete_account_data_for_deactivated_users    | {}                                                                                |                                       |     6803
 cache_invalidation_index_by_instance         | {}                                                                                |                                       |     6902
 state_group_edges_unique_idx                 | {}                                                                                |                                       |     7008
 event_edges_drop_invalid_rows                | {}                                                                                |                                       |     7101
 event_edges_replace_index                    | {}                                                                                | event_edges_drop_invalid_rows         |     7101
 event_push_actions_stream_highlight_index    | {}                                                                                |                                       |     7202
 events_populate_state_key_rejections         | {"min_stream_ordering_exclusive": -526, "max_stream_ordering_inclusive": 1584233} |                                       |     7203
 event_push_summary_unique_index2             | {}                                                                                |                                       |     7006
 event_push_backfill_thread_id                | {}                                                                                | event_push_summary_unique_index2      |     7006
 receipts_linearized_unique_index             | {}                                                                                |                                       |     7007
 receipts_graph_unique_index                  | {}                                                                                |                                       |     7007
 event_push_actions_thread_id_null            | {}                                                                                | event_push_backfill_thread_id         |     7306
 event_push_summary_thread_id_null            | {}                                                                                | event_push_backfill_thread_id         |     7306
 threads_backfill                             | {}                                                                                |                                       |     7309
 device_lists_changes_in_room_by_room_index   | {}                                                                                |                                       |     7313
 populate_user_directory_createtables         | {}                                                                                |                                       |     7322
 populate_user_directory_process_rooms        | {}                                                                                | populate_user_directory_createtables  |     7322
 populate_user_directory_process_users        | {}                                                                                | populate_user_directory_process_rooms |     7322
 populate_user_directory_cleanup              | {}                                                                                | populate_user_directory_process_users |     7322
 populate_stats_process_rooms                 | {}                                                                                |                                       |     7322
 populate_stats_process_users                 | {}                                                                                |                                       |     7322
 events_jump_to_date_index                    | {}                                                                                |                                       |     7324
 set_device_id_for_pushers                    | {}                                                                                |                                       |     7402
 room_membership_user_room_index              | {}                                                                                |                                       |     7403
 delete_e2e_backup_keys_for_deactivated_users | {}                                                                                |                                       |     7404
 profiles_full_user_id_key_idx                | {}                                                                                |                                       |     7501
 full_users_filters_unique_idx                | {}                                                                                |                                       |     7502
(35 rows)

synapse=# select * from schema_version;
 lock | version | upgraded 
------+---------+----------
 X    |      76 | t
(1 row)

synapse=# select * from schema_compat_version;
 lock | compat_version 
------+----------------
 X    |             74
(1 row)

After digging some more into this I noticed that the constraints which are preventing the column from being deleted are added in schema version 7403 while the error is from schema version 6001. So I think the ordering during the updates was not correct.

In a cloned setup I tried deleting the constraints from 7403 manually by running this:

synapse=# ALTER TABLE current_state_events DROP CONSTRAINT event_stream_ordering_fkey;
ALTER TABLE
synapse=# ALTER TABLE local_current_membership DROP CONSTRAINT event_stream_ordering_fkey;
ALTER TABLE
synapse=# ALTER TABLE room_memberships DROP CONSTRAINT event_stream_ordering_fkey;
ALTER TABLE

After starting synapse again the migration continues and finished without any more errors, but it took over an hour. At the end I tried adding the constraints again, which also worked:

synapse=# ALTER TABLE current_state_events ADD CONSTRAINT event_stream_ordering_fkey FOREIGN KEY (event_stream_ordering) REFERENCES events(stream_ordering) NOT VALID;
ALTER TABLE
synapse=# ALTER TABLE local_current_membership ADD CONSTRAINT event_stream_ordering_fkey FOREIGN KEY (event_stream_ordering) REFERENCES events(stream_ordering) NOT VALID;
ALTER TABLE
synapse=# ALTER TABLE room_memberships ADD CONSTRAINT event_stream_ordering_fkey FOREIGN KEY (event_stream_ordering) REFERENCES events(stream_ordering) NOT VALID;
ALTER TABLE

So I have a few questions for this: Is this a valid solution and is it save to do the migrations like that or did I made the whole setup much worse and should do something else to fix it?

@H-Shay
Copy link
Contributor

H-Shay commented May 30, 2023

Hello - it sounds like you are requesting support rather than reporting a bug. I am going to direct you to #synapse:matrix.org where you can ask questions and get support.

@H-Shay H-Shay closed this as completed May 30, 2023
@clokep
Copy link
Member

clokep commented May 30, 2023

Hello - it sounds like you are requesting support rather than reporting a bug. I am going to direct you to #synapse:matrix.org where you can ask questions and get support.

I disagree -- this very much sounds like we have a bug somewhere in our background updates and we should investigate further: it sounds like there might be a dependency. missing between background updates?

@clokep clokep reopened this May 30, 2023
@H-Shay
Copy link
Contributor

H-Shay commented May 30, 2023

Got it I will take a deeper look.

@H-Shay
Copy link
Contributor

H-Shay commented May 30, 2023

AFAICT, this looks like an interaction between the application of schema delta files and a background update. As the reporter noted, the schema delta file 60/01recreate_stream_ordering.sql.postgres kicks off a background update replace_stream_ordering_column which drops the stream_ordering column on the events table as part of recreating it:

"ALTER TABLE events DROP COLUMN stream_ordering",
"ALTER TABLE events RENAME COLUMN stream_ordering2 TO stream_ordering",

Meanwhile schema delta 74/03_membership_tables_event_stream_ordering.sql.postgres adds some constraints on the tables current_state_events, local_current_membership, and room_memberships which reference the stream_ordering column of events:

ALTER TABLE current_state_events ADD COLUMN event_stream_ordering BIGINT;
ALTER TABLE current_state_events ADD CONSTRAINT event_stream_ordering_fkey FOREIGN KEY (event_stream_ordering) REFERENCES events(stream_ordering) NOT VALID;
ALTER TABLE local_current_membership ADD COLUMN event_stream_ordering BIGINT;
ALTER TABLE local_current_membership ADD CONSTRAINT event_stream_ordering_fkey FOREIGN KEY (event_stream_ordering) REFERENCES events(stream_ordering) NOT VALID;
ALTER TABLE room_memberships ADD COLUMN event_stream_ordering BIGINT;
ALTER TABLE room_memberships ADD CONSTRAINT event_stream_ordering_fkey FOREIGN KEY (event_stream_ordering) REFERENCES events(stream_ordering) NOT VALID;

Schema deltas are applied when we are preparing the database, so my best guess is that since the second delta in question was not a background update it was applied immediately, adding the constraints referencing the stream_ordering column before the background updates had processed up to the update replace_stream_ordering_column, which then failed because the column could not be dropped due to the constraints referencing it.

I wonder if this could be fixed by changing the delta in 74/03_membership_tables_event_stream_ordering.sql.postgres from a foreground update to a background update and having it depend on the completion of the background update replace_stream_ordering_column (the update that was added in 60/01recreate_stream_ordering.sql.postgres).

As for the question:

Is this a valid solution and is it save to do the migrations like that or did I made the whole setup much worse and should do something else to fix it?

Dropping the offending constraints, running the rest of the migration and then re-adding the constraints back in doesn't sound unreasonable to me, but I will ask @clokep to confirm that this is a safe action as we generally do not advise admins operate on the database directly.

@MatMaul
Copy link
Contributor

MatMaul commented Jun 2, 2023

Dropping the offending constraints, running the rest of the migration and then re-adding the constraints back in doesn't sound unreasonable to me, but I will ask @clokep to confirm that this is a safe action as we generally do not advise admins operate on the database directly.

It looks legit to me.

As for every manual database action PLEASE PLEASE PLEASE backup your DB before trying that. Thanks !

@MatMaul
Copy link
Contributor

MatMaul commented Jun 2, 2023

I wonder if this could be fixed by changing the delta in 74/03_membership_tables_event_stream_ordering.sql.postgres from a foreground update to a background update and having it depend on the completion of the background update replace_stream_ordering_column (the update that was added in 60/01recreate_stream_ordering.sql.postgres).

This looks like a proper solution, but I am unsure that we have an easy way to express dependency between bg update jobs. I had a quick look but couldn't find an example, but I didn't review all of them there are a bunch.

A way to do it would be to busy wait for the bg update to be marked complete in the DB table for bg updates.

@H-Shay
Copy link
Contributor

H-Shay commented Jun 2, 2023

It looks like when inserting a background job in to the DB you can specify another background job that must finish before the one you are inserting can be run, like so:

INSERT INTO background_updates (update_name, progress_json) VALUES
('populate_stats_createtables', '{}');
-- Run through each room and update stats
INSERT INTO background_updates (update_name, progress_json, depends_on) VALUES
('populate_stats_process_rooms', '{}', 'populate_stats_createtables');

This was what I was thinking of.

@H-Shay H-Shay self-assigned this Jun 5, 2023
@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Jun 6, 2023
@H-Shay
Copy link
Contributor

H-Shay commented Jun 6, 2023

@Fizzadar pinging you here as this issue involves code that was submitted by you - I've suggested a fix for this at #15724 if you'd be willing to take a look and make sure it is congruent with what you were trying to achieve originally.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
5 participants