Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

account_data returned by /_matrix/client/unstable/org.matrix.msc3575/sync are sometimes inconsistent #189

Closed
nimau opened this issue Jun 28, 2023 · 12 comments · Fixed by #237
Assignees

Comments

@nimau
Copy link

nimau commented Jun 28, 2023

Description

I see a strange behaviour when updating a push rule using the push API /_matrix/client/v3/pushrules
Each request (PUT or DELETE) triggers a sync response but sometimes the response is incorrect.
Here, after creating a ROOM rule and deleting an OVERRIDE rule, I have 6 sync responses with different values in account_data.

Sometimes I have the m.push_rules more than once (#70, matrix-org/synapse#73), and sometimes and old value is coming back temporarily (#71 is up-to-date to date, matrix-org/synapse#75 contains the deleted rule)

Here are the requests visibles in the attached Proxyman log:

matrix-org/synapse#67: PUT ROOM rule (not notifying)
matrix-org/synapse#68: DELETE OVERRIDE rule
matrix-org/synapse#66: sync response: 2 rules
	ROOM + OVERRIDE
matrix-org/synapse#70: sync response: `account_data.global` contains the `m.push_rules` twice 
	ROOM + OVERRIDE
	ROOM + OVERRIDE
matrix-org/synapse#71: sync response: 1 rule
	ROOM
matrix-org/synapse#73: sync response: `account_data.global` contains the `m.push_rules` twice 
	ROOM
	ROOM
matrix-org/synapse#75: sync response: 2 rules (old data?)
	ROOM + OVERRIDE
matrix-org/synapse#77: sync response: 1 rule
	ROOM

ElementX_06-28-2023-16-02-36_ss_push_rules.proxymanlogv2.zip

(same session exported in postman format)
ElementX_06-28-2023-16-33-17_postman.json.zip

Steps to reproduce

  • list the steps
  • that reproduce the bug
  • using hyphens as bullet points

Homeserver

matrix.org

Synapse Version

1.86.0

Installation Method

I don't know

Database

I don't know

Workers

I don't know

Platform

I don't know

Configuration

No response

Relevant log output

see attached files.

Anything else that would be useful to know?

No response

@MatMaul MatMaul changed the title account_data returned by /_matrix/client/unstable/org.matrix.msc3575/sync are sometimes inconsistent account_data returned by /_matrix/client/unstable/org.matrix.msc3575/sync are sometimes inconsistent Jun 28, 2023
@MadLittleMods
Copy link

Given the extra sliding sync variable in the middle of all of this, it seems like the first thing that would need to be eliminated.

I'm unable to transfer this to https://github.com/matrix-org/sliding-sync but perhaps @DMRobertson has access?

@DMRobertson DMRobertson transferred this issue from matrix-org/synapse Jun 28, 2023
@DMRobertson
Copy link
Contributor

DMRobertson commented Jun 28, 2023

First step would be for us to reproduce this with some kind of test on the proxy side. Lines of investigation:

  • is this some kind of hs-to-proxy race?
  • what requests is the client making? are there lots of push rules updates in short succession? (presumably there is no kind of txn id for push rule updates...)
  • are there multiple pollers at play here that are fighting each other to emit push rules updates?

sync response: account_data.global contains the m.push_rules twice

this sounds wrong. Is the proxy not correctly deduping or merging account data updates here?

@kegsay
Copy link
Member

kegsay commented Jun 28, 2023

Account data updates have no ID so I suppose it's possible for them to step on each other if you have >1 device? In the same way that typing notifs can?

@kegsay kegsay added the roadmap label Jul 28, 2023
@nimau
Copy link
Author

nimau commented Jul 31, 2023

@kegsay I have made a few more tests and can confirm that the problem does not occur if there is only one session on EIX.
The problem appears as soon as there are at least 2 sessions on EIX.

Step to reproduce :

  • Open an EIX session on device A
  • when updating a push rule, only one /sync response is triggered
  • Open a second EIX session on device B
  • when updating a push rule on device A, 2 /sync responses are triggered and sometimes the "push_rules" is present twice in the same response.

@kegsay
Copy link
Member

kegsay commented Aug 2, 2023

#237 would mostly address this, as it will cause the following:

matrix-org/synapse#67: PUT ROOM rule (not notifying)
matrix-org/synapse#68: DELETE OVERRIDE rule
matrix-org/synapse#66: sync response: 1 rule
	ROOM + OVERRIDE
matrix-org/synapse#71: sync response: 1 rule
	ROOM
matrix-org/synapse#75: sync response: 1 rule
	ROOM + OVERRIDE
matrix-org/synapse#77: sync response: 1 rule
	ROOM

In other words it will A) stop multiple push rules coming through in the response and B) ensure you don't get told the same data twice. However, it does nothing to help guard against flickering (where it temporarily resets back to the old value then advances forward) as the proxy has no way to uniquely identify versions of the event being updated.

@kegsay
Copy link
Member

kegsay commented Aug 2, 2023

Precise timings in the logs:

  • Request End Time: 14:54:52.686 (req 67 to update room push rules)
  • Request End Time: 14:54:52.758 (req 68 to delete an override)
  • Response End Time: 14:54:52.775 (req 66 blocking on a sync response, 2x push rules, updated room push rule)
  • Response End Time: 14:54:52.815 (req 70 sync, 1x push rule, updated room push rule value)
  • Response End Time: 14:54:52.856 (req 71 sync, 1x push rule, fully up-to-date)
  • Response End Time: 14:54:52.959 (req 75 sync, containing reverted value, updated room push rule value only)

So there was a latency gap of ~200ms on one poller, which caused it to revert briefly.

@kegsay kegsay reopened this Aug 2, 2023
@kegsay
Copy link
Member

kegsay commented Aug 2, 2023

This isn't resolved yet because flickering can still happen. This issue is now tracking the resolution of this, which dmr hints at:

maybe it makes more sense to do a single dedicated account data poller? There was that race I outlined on switchover which would be bad... but maybe you can patch around it by doing an explicit account data GET

race being:

  • poller A is the acct data poller
  • poller B sees some new piece of acct data
  • poller A stops, before it sees the new acct data
  • poller B becomes the nominated poller

the proxy won't see that new piece of acct data ever

The problem is there is no bulk GET endpoint for account data, so if "new piece of acct data" is a brand new unseen type, we won't know the endpoint to hit. To fix this, we can track which room|type hashes each poller has seen, so when we do the switcheroo we can just diff the hashes and if they aren't the same -> go GET the event bytes.

Data structure wise, if each poller has a room|type map, and the handler2 contains a map of PollerID -> *Poller, we can then track this information reasonably well, as expired pollers will be cleared up eventually.

@pmaier1
Copy link

pmaier1 commented Aug 15, 2023

@manuroe @nimau we'd like to know whether this is good enough after applying #237 for the September time frame. If yes, we'd like to deprioritise further improvements.

@kegsay kegsay added the poller label Aug 16, 2023
@kegsay
Copy link
Member

kegsay commented Aug 23, 2023

@nimau says:

that's definitely better. Can we wait for Manu to decide whether the flicker is a blocker or not? (I would say no because it works and changing notification settings is probably not something we do every day, but for the user it may be perceived as strange behaviour).

@kegsay
Copy link
Member

kegsay commented Sep 15, 2023

Empirically this didn't seem to be a blocker.

@hughns
Copy link
Member

hughns commented Sep 26, 2023

@manuroe please can you advise on the flicker question as per above?

@manuroe
Copy link

manuroe commented Sep 26, 2023

I am aligned with @kegsay on "changing notification settings is probably not something we do every day". The current solution is good enough. Let's fix it properly in the next gen ;)
I checked with @nimau . He has not seen the issue for a long time. The feature was implemented later on Android without noticing the issue. I am closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants