-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In sync wait for worker to catch up since token #17215
Conversation
Otherwise things will get confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable, maybe something better could be done about the sleeps but I agree with the 'TODO be better' :D
Oh bleurgh, the typing stream can legitimately go backwards 🙄 |
somewhat-flakey Complement failure is because the events stream (and possibly another) is going backwards. I don't know why this happens but we've noticed it before in Complement.
as an example of where it gets stuck, when locally enhanced with logging |
b1300c4
to
a84615e
Compare
Right, so the problem is that there are occasions where we ask for a new stream ID from the generator, but then don't persist it in the DB. If this happens before a restart then the This doesn't affect I'm not quite sure what the right way forward is her here, ideally we'd make it so that stream IDs must be used, but that is hard. |
Actually, maybe we use |
|
There is a problem with `StreamIdGenerator` where it can go backwards over restarts when a stream ID is requested but then not inserted into the DB. This is problematic if we want to land #17215, and is generally a potential cause for all sorts of nastiness. Instead of trying to fix `StreamIdGenerator`, we may as well move to `MultiWriterIdGenerator` that does not suffer from this problem (the latest positions are stored in `stream_positions` table). This involves adding SQLite support to the class. This only changes id generators that were already using `MultiWriterIdGenerator` under postgres, a separate PR will move the rest of the uses of `StreamIdGenerator` over.
) There is a problem with `StreamIdGenerator` where it can go backwards over restarts when a stream ID is requested but then not inserted into the DB. This is problematic if we want to land element-hq#17215, and is generally a potential cause for all sorts of nastiness. Instead of trying to fix `StreamIdGenerator`, we may as well move to `MultiWriterIdGenerator` that does not suffer from this problem (the latest positions are stored in `stream_positions` table). This involves adding SQLite support to the class. This only changes id generators that were already using `MultiWriterIdGenerator` under postgres, a separate PR will move the rest of the uses of `StreamIdGenerator` over.
Otherwise things will get confused. An alternative would be to make sure that for lagging stream we don't return anything (and make sure the returned next_batch token doesn't go backwards). But that is a faff.
) There is a problem with `StreamIdGenerator` where it can go backwards over restarts when a stream ID is requested but then not inserted into the DB. This is problematic if we want to land element-hq#17215, and is generally a potential cause for all sorts of nastiness. Instead of trying to fix `StreamIdGenerator`, we may as well move to `MultiWriterIdGenerator` that does not suffer from this problem (the latest positions are stored in `stream_positions` table). This involves adding SQLite support to the class. This only changes id generators that were already using `MultiWriterIdGenerator` under postgres, a separate PR will move the rest of the uses of `StreamIdGenerator` over.
Otherwise things will get confused. An alternative would be to make sure that for lagging stream we don't return anything (and make sure the returned next_batch token doesn't go backwards). But that is a faff.
- Fix the building of binary wheels for macOS by switching to macOS 12 CI runners. ([\#17319](element-hq/synapse#17319)) - When rolling back to a previous Synapse version and then forwards again to this release, don't require server operators to manually run SQL. ([\#17305](element-hq/synapse#17305), [\#17309](element-hq/synapse#17309)) - Use the release branch for sytest in release-branch PRs. ([\#17306](element-hq/synapse#17306)) - Fix bug where one-time-keys were not always included in `/sync` response when using workers. Introduced in v1.109.0rc1. ([\#17275](element-hq/synapse#17275)) - Fix bug where `/sync` could get stuck due to edge case in device lists handling. Introduced in v1.109.0rc1. ([\#17292](element-hq/synapse#17292)) - Add the ability to auto-accept invites on the behalf of users. See the [`auto_accept_invites`](https://element-hq.github.io/synapse/latest/usage/configuration/config_documentation.html#auto-accept-invites) config option for details. ([\#17147](element-hq/synapse#17147)) - Add experimental [MSC3575](matrix-org/matrix-spec-proposals#3575) Sliding Sync `/sync/e2ee` endpoint for to-device messages and device encryption info. ([\#17167](element-hq/synapse#17167)) - Support [MSC3916](matrix-org/matrix-spec-proposals#3916) by adding unstable media endpoints to `/_matrix/client`. ([\#17213](element-hq/synapse#17213)) - Add logging to tasks managed by the task scheduler, showing CPU and database usage. ([\#17219](element-hq/synapse#17219)) - Fix deduplicating of membership events to not create unused state groups. ([\#17164](element-hq/synapse#17164)) - Fix bug where duplicate events could be sent down sync when using workers that are overloaded. ([\#17215](element-hq/synapse#17215)) - Ignore attempts to send to-device messages to bad users, to avoid log spam when we try to connect to the bad server. ([\#17240](element-hq/synapse#17240)) - Fix handling of duplicate concurrent uploading of device one-time-keys. ([\#17241](element-hq/synapse#17241)) - Fix reporting of default tags to Sentry, such as worker name. Broke in v1.108.0. ([\#17251](element-hq/synapse#17251)) - Fix bug where typing updates would not be sent when using workers after a restart. ([\#17252](element-hq/synapse#17252)) - Update the LemonLDAP documentation to say that claims should be explicitly included in the returned `id_token`, as Synapse won't request them. ([\#17204](element-hq/synapse#17204)) - Improve DB usage when fetching related events. ([\#17083](element-hq/synapse#17083)) - Log exceptions when failing to auto-join new user according to the `auto_join_rooms` option. ([\#17176](element-hq/synapse#17176)) - Reduce work of calculating outbound device lists updates. ([\#17211](element-hq/synapse#17211)) - Improve performance of calculating device lists changes in `/sync`. ([\#17216](element-hq/synapse#17216)) - Move towards using `MultiWriterIdGenerator` everywhere. ([\#17226](element-hq/synapse#17226)) - Replaces all usages of `StreamIdGenerator` with `MultiWriterIdGenerator`. ([\#17229](element-hq/synapse#17229)) - Change the `allow_unsafe_locale` config option to also apply when setting up new databases. ([\#17238](element-hq/synapse#17238)) - Fix errors in logs about closing incorrect logging contexts when media gets rejected by a module. ([\#17239](element-hq/synapse#17239), [\#17246](element-hq/synapse#17246)) - Clean out invalid destinations from `device_federation_outbox` table. ([\#17242](element-hq/synapse#17242)) - Stop logging errors when receiving invalid User IDs in key querys requests. ([\#17250](element-hq/synapse#17250)) * Bump anyhow from 1.0.83 to 1.0.86. ([\#17220](element-hq/synapse#17220)) * Bump bcrypt from 4.1.2 to 4.1.3. ([\#17224](element-hq/synapse#17224)) * Bump lxml from 5.2.1 to 5.2.2. ([\#17261](element-hq/synapse#17261)) * Bump mypy-zope from 1.0.3 to 1.0.4. ([\#17262](element-hq/synapse#17262)) * Bump phonenumbers from 8.13.35 to 8.13.37. ([\#17235](element-hq/synapse#17235)) * Bump prometheus-client from 0.19.0 to 0.20.0. ([\#17233](element-hq/synapse#17233)) * Bump pyasn1 from 0.5.1 to 0.6.0. ([\#17223](element-hq/synapse#17223)) * Bump pyicu from 2.13 to 2.13.1. ([\#17236](element-hq/synapse#17236)) * Bump pyopenssl from 24.0.0 to 24.1.0. ([\#17234](element-hq/synapse#17234)) * Bump serde from 1.0.201 to 1.0.202. ([\#17221](element-hq/synapse#17221)) * Bump serde from 1.0.202 to 1.0.203. ([\#17232](element-hq/synapse#17232)) * Bump twine from 5.0.0 to 5.1.0. ([\#17225](element-hq/synapse#17225)) * Bump types-psycopg2 from 2.9.21.20240311 to 2.9.21.20240417. ([\#17222](element-hq/synapse#17222)) * Bump types-pyopenssl from 24.0.0.20240311 to 24.1.0.20240425. ([\#17260](element-hq/synapse#17260))
Fixes #17274, hopefully. Basically, old versions of Synapse could advance streams without persisting anything in the DB (fixed in #17229). On restart those updates would get lost, and so the position of the stream would revert to an older position. If this happened across an upgrade to a later Synapse version which included #17215, then sync could get blocked indefinitely (until the stream advanced to the position in the token). We fix this by bounding the stream positions we'll wait for to the maximum position of the underlying stream ID generator.
Fixes #17274, hopefully. Basically, old versions of Synapse could advance streams without persisting anything in the DB (fixed in #17229). On restart those updates would get lost, and so the position of the stream would revert to an older position. If this happened across an upgrade to a later Synapse version which included #17215, then sync could get blocked indefinitely (until the stream advanced to the position in the token). We fix this by bounding the stream positions we'll wait for to the maximum position of the underlying stream ID generator.
…ced in v1.109.0. (#17428) Introduced in: #17215 This caused us a minor bit of grief as the volume of logs produced was much higher than normal --------- Signed-off-by: Olivier 'reivilibre <[email protected]>
Otherwise things will get confused.
An alternative would be to make sure that for lagging stream we don't return anything (and make sure the returned next_batch token doesn't go backwards). But that is a faff.