Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Stabilize support for MSC3970: updated transaction semantics (scope to device_id) #15629

Merged
merged 15 commits into from
Aug 4, 2023

Conversation

clokep
Copy link
Member

@clokep clokep commented May 18, 2023

MSC3970 passed FCP, we can rip out the unstable config for it now.

@clokep
Copy link
Member Author

clokep commented May 18, 2023

CC @sandhose & @hughns

synapse/events/utils.py Show resolved Hide resolved
synapse/handlers/room_member.py Outdated Show resolved Hide resolved
synapse/rest/client/transactions.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/events.py Show resolved Hide resolved
@sandhose
Copy link
Member

The complement tests are failing because we explicitly added tests to test the pre-MSC3970 behaviour, and there are post-MSC3970 tests ready under a build flag

@clokep
Copy link
Member Author

clokep commented May 23, 2023

The complement tests are failing because we explicitly added tests to test the pre-MSC3970 behaviour, and there are post-MSC3970 tests ready under a build flag

I'm a bit confused about this, I see:

@sandhose Can you help me untangle this? From my reading of the above, I'm surprised anything is failing with this PR?

@clokep clokep mentioned this pull request May 23, 2023
12 tasks
@hughns
Copy link
Member

hughns commented May 26, 2023

@clokep I spoke with Q earlier and made a plan to untangle this.

We now have a single Complement PR, matrix-org/complement#637, which shows the appropriate changes to assert the new transaction ID behaviour.

However, as noted in comments in the PR the tests are currently skipped on Synapse and Dendrite due to neither supporting the new transaction behaviour yet.

How should this be handled? Effectively both this PR and the one in Complement should be merged at the same time.

(As discussed in the MSC, this is technically a breaking change hence the cyclical dependency)

@clokep
Copy link
Member Author

clokep commented May 26, 2023

However, as noted in comments in the PR the tests are currently skipped on Synapse and Dendrite due to neither supporting the new transaction behaviour yet.

How should this be handled? Effectively both this PR and the one in Complement should be merged at the same time.

I repushed matrix-org/complement#637 as clokep/stable-txn and added a commit to remove the test skipping for Synapse (matrix-org/complement@768f359) so that tests will run in CI via branch matching.

@clokep clokep changed the title Stabilize support for MSC3970. Stabilize support for MSC3970: updated transaction semantics May 26, 2023
@clokep clokep marked this pull request as ready for review May 26, 2023 19:11
@clokep clokep requested a review from a team as a code owner May 26, 2023 19:11
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major; happy to leave any minor comment changes in your hands

synapse/storage/databases/main/events.py Outdated Show resolved Hide resolved
Comment on lines 1015 to 1016
# TODO Remove this once Synapse cannot be rolled back to a version without
# device IDs being stored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a target version of SCHEMA_COMPAT_VERSION you need? If you can find out, might be worth writing it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errr, so right now we write to both event_txn_id and event_txn_id_device_id. I think we want to leave this as-is so we can rollback if needed.

I guess in the future we would want to:

  • Wait
  • Stop writing (and reading to) to event_txn_id and bump the SCHEMA_VERSION
  • Wait
  • Bump the SCHEMA_COMPAT_VERSION, rip out the code

Does that sound like the right series of steps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this a bit: https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$WQwonqO7etYpa0ik4aVzVsf32R_VOqIyPzBjTl3pjps?via=matrix.org&via=element.io&via=beeper.com

My tl;dr is that the above steps look reasonable, although you usually want to stop reading before writing. Unfortunately we do still use this for certain situations, e.g. appservice users don't have device IDs. So we would need to figure out what to do there?

My inclination is to merge this as-is (so as to not block Matrix 1.7 support) and file a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we do still use this for certain situations, e.g. appservice users don't have device IDs. So we would need to figure out what to do there?

Probably good to add a comment to the diff on the caveat for why the access token fallback needs to stay until that is figured out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was planning to point to an issue. Was waiting to see if others had thoughts before clicking merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a step back and re-evaluating my other threads with @sandhose (and re-reading the comments in this PR) I don't actually think we can remove this since it gets used for guests, appservice users, and access tokens from the admin API. 😢

I've made two changes:

  • Consolidate some logic so we have fewer places we attempt to line up an event from a transaction ID.
  • Added a lot more comments.

Could folks (in particular @sandhose) let me know if these still read OK? I feel like I'm missing something, but I've tried to include my current understanding in the comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, guests and appservice users don't have access token IDs anyway, so it basically won't do anything if we keep it. The only thing left is admin-minted, device-less tokens.

Keep in mind, there are two layers of things that ensure idempotency:

  • the in-memory HTTP transaction cache, which properly covers guests, appservices and device-less tokens
  • the db-persisted (for 24h) transaction_id -> event_id mapping, which only worked for access tokens with IDs in it, so no guests and no appservices

Also, this is only for idempotency, the transaction ID echo in event serialisation is handled by the event internal metadata JSON blob, which is separate from those tables.

If I'm not mistaken, I think this would only affect

  • device-less real users
  • when they replay a request, either load-balanced to another worker, or after a Synapse restart

I don't mind keeping this, but also I don't think it's worth keeping it around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is only for idempotency, the transaction ID echo in event serialisation is handled by the event internal metadata JSON blob, which is separate from those tables.

And I suppose the same rules apply to the event internal metadata info -- guests and appservices users don't have access token IDs so we can't be storing it there anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through this a bit more I think we need two separate approaches:

Database transaction mapping

This describes what to do with "the db-persisted (for 24h) transaction_id -> event_id mapping".

This only survives for 24 hours, but that's not really relevant to the conversation of rolling back (or forwards). As background, there's two tables (event_txn_id and event_txn_id_device_id) for access token IDs and device IDs, respectively. To gracefully handle up/downgrades I propose:

  1. (This PR = N) Always write to both and read data from either (first event_txn_id_device_id, fallback to event_txn_id). Bump the schema version to V.
  2. In a future PR (N+2 versions), stop reading and writing to event_txn_id, bump the schema version to V+1.

At this point we have a database schema still compatible with today, but we're not using data event_txn_id, so rolling back to e.g. N is fine, but not before N. To enforce this, bump the minimum schema version to V.

  1. In a future PR (N+4 versions), add a schema delta which drops the event_txn_id table.

At this point we no longer have a database schema compatible with N, but we can still rollback OK to N+2. So bump the minimum schema version to V+1.

Internal metadata

The device ID & access token ID are unconditionally written into the internal metadata of events. It then follows similar logic where the transaction ID is returned to a user when serializing the event if:

  1. The device ID exists in the internal metadata and matches that of the sender. 1
  2. The requester & sender user IDs match and:
    • The token ID exists in the internal metadata and matches that of the sender.
    • The requester is a guest.
    • The requester is an appservice.

A similar thing should be followed:

  1. (This PR = N) Start using the device ID in-lieu of the token ID, if it is set. (Note that both are currently saved in the internal metadata.)
  2. In a future PR (N+2), stop writing the token ID into the internal metadata.

This implies that we would leave the logic for the fallback "forever", but I think we're already at am minimum schema version where we've always written both values, so we only need to be concerned with old data.

An option here would be to do a data migration for any event who has a token ID which still exists in the database, add the device ID. Then remove the fallback code. This seems a bit onerous though.

Footnotes

  1. Device IDs matching should also be gated by the user ID, I think. There's more risk of those overlapping that the token IDs since they're client controllable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16042 was filed to handle follow-up.

@MadLittleMods MadLittleMods changed the title Stabilize support for MSC3970: updated transaction semantics Stabilize support for MSC3970: updated transaction semantics (scope to device_id) May 31, 2023
Copy link
Member

@sandhose sandhose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, we just need to decide whether we should keep the (txn_id, token_id) -> event_id mapping in DB or not

synapse/storage/databases/main/events.py Show resolved Hide resolved
Comment on lines 1015 to 1016
# TODO Remove this once Synapse cannot be rolled back to a version without
# device IDs being stored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, guests and appservice users don't have access token IDs anyway, so it basically won't do anything if we keep it. The only thing left is admin-minted, device-less tokens.

Keep in mind, there are two layers of things that ensure idempotency:

  • the in-memory HTTP transaction cache, which properly covers guests, appservices and device-less tokens
  • the db-persisted (for 24h) transaction_id -> event_id mapping, which only worked for access tokens with IDs in it, so no guests and no appservices

Also, this is only for idempotency, the transaction ID echo in event serialisation is handled by the event internal metadata JSON blob, which is separate from those tables.

If I'm not mistaken, I think this would only affect

  • device-less real users
  • when they replay a request, either load-balanced to another worker, or after a Synapse restart

I don't mind keeping this, but also I don't think it's worth keeping it around.

@clokep clokep requested review from reivilibre and a team August 1, 2023 12:38
@clokep
Copy link
Member Author

clokep commented Aug 1, 2023

I've put this back into the review queue -- I split some changes out and made a few additional changes to this PR. See above for an explanation of future work / how the database changes will work.

@clokep clokep merged commit d98a43d into develop Aug 4, 2023
37 checks passed
@clokep clokep deleted the clokep/stable-txn branch August 4, 2023 11:47
realtyem added a commit to realtyem/synapse-old that referenced this pull request Aug 6, 2023
yingziwu added a commit to yingziwu/synapse that referenced this pull request Aug 21, 2023
No significant changes since 1.90.0rc1.

- Scope transaction IDs to devices (implement [MSC3970](matrix-org/matrix-spec-proposals#3970)). ([\matrix-org#15629](matrix-org#15629))
- Remove old rows from the `cache_invalidation_stream_by_instance` table automatically (this table is unused in SQLite). ([\matrix-org#15868](matrix-org#15868))

- Fix a long-standing bug where purging history and paginating simultaneously could lead to database corruption when using workers. ([\matrix-org#15791](matrix-org#15791))
- Fix a long-standing bug where profile endpoint returned a 404 when the user's display name was empty. ([\matrix-org#16012](matrix-org#16012))
- Fix a long-standing bug where the `synapse_port_db` failed to configure sequences for application services and partial stated rooms. ([\matrix-org#16043](matrix-org#16043))
- Fix long-standing bug with deletion in dehydrated devices v2. ([\matrix-org#16046](matrix-org#16046))

- Add `org.opencontainers.image.version` labels to Docker containers [published by Matrix.org](https://hub.docker.com/r/matrixdotorg/synapse). Contributed by Mo Balaa. ([\matrix-org#15972](matrix-org#15972), [\matrix-org#16009](matrix-org#16009))

- Add a internal documentation page describing the ["streams" used within Synapse](https://matrix-org.github.io/synapse/v1.90/development/synapse_architecture/streams.html). ([\matrix-org#16015](matrix-org#16015))
- Clarify comment on the keys/upload over replication enpoint. ([\matrix-org#16016](matrix-org#16016))
- Do not expose Admin API in caddy reverse proxy example. Contributed by @NilsIrl. ([\matrix-org#16027](matrix-org#16027))

- Remove support for legacy application service paths. ([\matrix-org#15964](matrix-org#15964))
- Move support for application service query parameter authorization behind a configuration option. ([\matrix-org#16017](matrix-org#16017))

- Update SQL queries to inline boolean parameters as supported in SQLite 3.27. ([\matrix-org#15525](matrix-org#15525))
- Allow for the configuration of the backoff algorithm for federation destinations. ([\matrix-org#15754](matrix-org#15754))
- Allow modules to check whether the current worker is configured to run background tasks. ([\matrix-org#15991](matrix-org#15991))
- Update support for [MSC3958](matrix-org/matrix-spec-proposals#3958) to match the latest revision of the MSC. ([\matrix-org#15992](matrix-org#15992))
- Allow modules to schedule delayed background calls. ([\matrix-org#15993](matrix-org#15993))
- Properly overwrite the `redacts` content-property for forwards-compatibility with room versions 1 through 10. ([\matrix-org#16013](matrix-org#16013))
- Fix building the nix development environment on MacOS systems. ([\matrix-org#16019](matrix-org#16019))
- Remove leading and trailing spaces when setting a display name. ([\matrix-org#16031](matrix-org#16031))
- Combine duplicated code. ([\matrix-org#16023](matrix-org#16023))
- Collect additional metrics from `ResponseCache` for eviction. ([\matrix-org#16028](matrix-org#16028))
- Fix endpoint improperly declaring support for MSC3814. ([\matrix-org#16068](matrix-org#16068))
- Drop backwards compat hack for event serialization. ([\matrix-org#16069](matrix-org#16069))

* Update PyYAML to 6.0.1. ([\matrix-org#16011](matrix-org#16011))
* Bump cryptography from 41.0.2 to 41.0.3. ([\matrix-org#16048](matrix-org#16048))
* Bump furo from 2023.5.20 to 2023.7.26. ([\matrix-org#16077](matrix-org#16077))
* Bump immutabledict from 2.2.4 to 3.0.0. ([\matrix-org#16034](matrix-org#16034))
* Update certifi to 2023.7.22 and pygments to 2.15.1. ([\matrix-org#16044](matrix-org#16044))
* Bump jsonschema from 4.18.3 to 4.19.0. ([\matrix-org#16081](matrix-org#16081))
* Bump phonenumbers from 8.13.14 to 8.13.18. ([\matrix-org#16076](matrix-org#16076))
* Bump regex from 1.9.1 to 1.9.3. ([\matrix-org#16073](matrix-org#16073))
* Bump serde from 1.0.171 to 1.0.175. ([\matrix-org#15982](matrix-org#15982))
* Bump serde from 1.0.175 to 1.0.179. ([\matrix-org#16033](matrix-org#16033))
* Bump serde from 1.0.179 to 1.0.183. ([\matrix-org#16074](matrix-org#16074))
* Bump serde_json from 1.0.103 to 1.0.104. ([\matrix-org#16032](matrix-org#16032))
* Bump service-identity from 21.1.0 to 23.1.0. ([\matrix-org#16038](matrix-org#16038))
* Bump types-commonmark from 0.9.2.3 to 0.9.2.4. ([\matrix-org#16037](matrix-org#16037))
* Bump types-jsonschema from 4.17.0.8 to 4.17.0.10. ([\matrix-org#16036](matrix-org#16036))
* Bump types-netaddr from 0.8.0.8 to 0.8.0.9. ([\matrix-org#16035](matrix-org#16035))
* Bump types-opentracing from 2.4.10.5 to 2.4.10.6. ([\matrix-org#16078](matrix-org#16078))
* Bump types-setuptools from 68.0.0.0 to 68.0.0.3. ([\matrix-org#16079](matrix-org#16079))
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Sep 18, 2023
No significant changes since 1.90.0rc1.

- Scope transaction IDs to devices (implement [MSC3970](matrix-org/matrix-spec-proposals#3970)). ([\matrix-org#15629](matrix-org#15629))
- Remove old rows from the `cache_invalidation_stream_by_instance` table automatically (this table is unused in SQLite). ([\matrix-org#15868](matrix-org#15868))

- Fix a long-standing bug where purging history and paginating simultaneously could lead to database corruption when using workers. ([\matrix-org#15791](matrix-org#15791))
- Fix a long-standing bug where profile endpoint returned a 404 when the user's display name was empty. ([\matrix-org#16012](matrix-org#16012))
- Fix a long-standing bug where the `synapse_port_db` failed to configure sequences for application services and partial stated rooms. ([\matrix-org#16043](matrix-org#16043))
- Fix long-standing bug with deletion in dehydrated devices v2. ([\matrix-org#16046](matrix-org#16046))

- Add `org.opencontainers.image.version` labels to Docker containers [published by Matrix.org](https://hub.docker.com/r/matrixdotorg/synapse). Contributed by Mo Balaa. ([\matrix-org#15972](matrix-org#15972), [\matrix-org#16009](matrix-org#16009))

- Add a internal documentation page describing the ["streams" used within Synapse](https://matrix-org.github.io/synapse/v1.90/development/synapse_architecture/streams.html). ([\matrix-org#16015](matrix-org#16015))
- Clarify comment on the keys/upload over replication enpoint. ([\matrix-org#16016](matrix-org#16016))
- Do not expose Admin API in caddy reverse proxy example. Contributed by @NilsIrl. ([\matrix-org#16027](matrix-org#16027))

- Remove support for legacy application service paths. ([\matrix-org#15964](matrix-org#15964))
- Move support for application service query parameter authorization behind a configuration option. ([\matrix-org#16017](matrix-org#16017))

- Update SQL queries to inline boolean parameters as supported in SQLite 3.27. ([\matrix-org#15525](matrix-org#15525))
- Allow for the configuration of the backoff algorithm for federation destinations. ([\matrix-org#15754](matrix-org#15754))
- Allow modules to check whether the current worker is configured to run background tasks. ([\matrix-org#15991](matrix-org#15991))
- Update support for [MSC3958](matrix-org/matrix-spec-proposals#3958) to match the latest revision of the MSC. ([\matrix-org#15992](matrix-org#15992))
- Allow modules to schedule delayed background calls. ([\matrix-org#15993](matrix-org#15993))
- Properly overwrite the `redacts` content-property for forwards-compatibility with room versions 1 through 10. ([\matrix-org#16013](matrix-org#16013))
- Fix building the nix development environment on MacOS systems. ([\matrix-org#16019](matrix-org#16019))
- Remove leading and trailing spaces when setting a display name. ([\matrix-org#16031](matrix-org#16031))
- Combine duplicated code. ([\matrix-org#16023](matrix-org#16023))
- Collect additional metrics from `ResponseCache` for eviction. ([\matrix-org#16028](matrix-org#16028))
- Fix endpoint improperly declaring support for MSC3814. ([\matrix-org#16068](matrix-org#16068))
- Drop backwards compat hack for event serialization. ([\matrix-org#16069](matrix-org#16069))

* Update PyYAML to 6.0.1. ([\matrix-org#16011](matrix-org#16011))
* Bump cryptography from 41.0.2 to 41.0.3. ([\matrix-org#16048](matrix-org#16048))
* Bump furo from 2023.5.20 to 2023.7.26. ([\matrix-org#16077](matrix-org#16077))
* Bump immutabledict from 2.2.4 to 3.0.0. ([\matrix-org#16034](matrix-org#16034))
* Update certifi to 2023.7.22 and pygments to 2.15.1. ([\matrix-org#16044](matrix-org#16044))
* Bump jsonschema from 4.18.3 to 4.19.0. ([\matrix-org#16081](matrix-org#16081))
* Bump phonenumbers from 8.13.14 to 8.13.18. ([\matrix-org#16076](matrix-org#16076))
* Bump regex from 1.9.1 to 1.9.3. ([\matrix-org#16073](matrix-org#16073))
* Bump serde from 1.0.171 to 1.0.175. ([\matrix-org#15982](matrix-org#15982))
* Bump serde from 1.0.175 to 1.0.179. ([\matrix-org#16033](matrix-org#16033))
* Bump serde from 1.0.179 to 1.0.183. ([\matrix-org#16074](matrix-org#16074))
* Bump serde_json from 1.0.103 to 1.0.104. ([\matrix-org#16032](matrix-org#16032))
* Bump service-identity from 21.1.0 to 23.1.0. ([\matrix-org#16038](matrix-org#16038))
* Bump types-commonmark from 0.9.2.3 to 0.9.2.4. ([\matrix-org#16037](matrix-org#16037))
* Bump types-jsonschema from 4.17.0.8 to 4.17.0.10. ([\matrix-org#16036](matrix-org#16036))
* Bump types-netaddr from 0.8.0.8 to 0.8.0.9. ([\matrix-org#16035](matrix-org#16035))
* Bump types-opentracing from 2.4.10.5 to 2.4.10.6. ([\matrix-org#16078](matrix-org#16078))
* Bump types-setuptools from 68.0.0.0 to 68.0.0.3. ([\matrix-org#16079](matrix-org#16079))

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEE8SRSDO7gYkSP4chELS76LzL74EcFAmTbUOEACgkQLS76LzL7
# 4Efskw/+J4X30PoqSvBWbilr8kTzwNSDXrkefYXR2sLburgowCyuAtKtCdbvnZUX
# 3KRwii5/GDsduXiNY836oRxO/KWE43b1ce9C9qJM7V6NmgkJBgHRvnh69wdlmBqt
# 6b6TQHoEByYS7yK90+QsRm1Bqrw7eoVO9oxcZ+4lb7Mjswf491Pga8kFJqdvjtTX
# UXo4vAqYyP6Yn7sUrQmXy0N8gZ5ZFHhZEvZZ8+iEsNjPO468cSVGq8/iPB1EwBm2
# nbfZWMDnD2p7plJezXOPEBxnVR3RrWbCbK08SiiNMcQynCvBgAUfkd3GnsO726jb
# 19i8p6tjuj2r41UgqYCTG2i2ij6uJquA/qq3rIiVNQVKG9aPHQ8hJfu9XOdEvaJe
# Je9H6QFrU/hR640tFvb5Hdc/4ThabvtC5xgl4ZGT6y6I0s5LNwk8fJiz3sDFt0i1
# XKsqGVemBGopZnwjQIPFJaHjPT7of33PXLE/hf1vX+oXU/6MNbFYkDLY9nnnQeOx
# 0GEbiYaxrj8SfxNmEykMLNCfxwJ719cSR1q8vPYn6r6TOS1pJMV0SgciXoaQ/VW6
# WlRpZolvXYSye34JW8Rg4ojAz0oYfJ2IiUpwY7eSEq4DtuosTjEECKrgB8DLqKlM
# +qDd4/yeqVN5/sui5oGsR71aTMy/jdnzqmdmuFvsSwz9/7PfMEU=
# =FWp5
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue Aug 15 11:18:09 2023 BST
# gpg:                using RSA key F124520CEEE062448FE1C8442D2EFA2F32FBE047
# gpg: Can't check signature: No public key

# Conflicts:
#	.github/workflows/docker.yml
#	CHANGES.md
#	Cargo.lock
#	debian/changelog
#	docs/upgrade.md
#	flake.lock
#	poetry.lock
#	pyproject.toml
#	rust/src/push/base_rules.rs
#	synapse/events/utils.py
#	synapse/handlers/pagination.py
#	synapse/http/site.py
#	synapse/rest/client/devices.py
#	synapse/storage/databases/main/roommember.py
#	synapse/storage/schema/__init__.py
#	tests/rest/client/test_devices.py
#	tests/rest/client/test_redactions.py
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implementation of MSC2918 refresh tokens makes transaction ID scoping in violation of spec
6 participants