Faster joins: handle total failure to sync state #13000

richvdh · 2022-06-09T08:20:12Z

Currently, if we try every server in the room and are unable to sync state from any of them, we give up, leaving us with a room stuck in "partial state" state, and any C-S requests for state in that room timing out indefinitely.

It's not entirely clear what we should do in this case:

Giving up isn't the right thing to do if there's a temporary network outage
Retrying indefinitely is also not the right thing to do if we can reach all homeservers and they all claim they don't have the state we want.

synapse/synapse/handlers/federation.py

Lines 1594 to 1610 in 7c6b220

    
           if attempt == len(destinations) - 1: 
        
               # We have tried every remote server for this event. Give up. 
        
               # TODO(faster_joins) giving up isn't the right thing to do 
        
               #   if there's a temporary network outage. retrying 
        
               #   indefinitely is also not the right thing to do if we can 
        
               #   reach all homeservers and they all claim they don't have 
        
               #   the state we want. 
        
               #   https://github.com/matrix-org/synapse/issues/13000 
        
               logger.error( 
        
                   "Failed to get state for %s at %s from %s because %s, " 
        
                   "giving up!", 
        
                   room_id, 
        
                   event, 
        
                   destination, 
        
                   e, 
        
               ) 
        
               raise

reivilibre · 2022-07-28T15:05:06Z

As you say, it's not clear what we want in this case.

Should we eventually boot the user(s) out of the room and shut it down, pretending it never happened? That sounds pretty janky, but perhaps defensible if the UI makes it clear that you're not 'properly joined' whilst the partial join is going on. Is that something we'd want to do?

richvdh · 2022-07-28T16:01:31Z

I kinda think that's what we'll have to do, ultimately, though we'd probably have to figure out a way to get the memo to the clients about the reason we're giving up on the room. To be honest that sounds like a general problem - "we've given up on this room" can happen for other reasons (notably: it getting shut down by an admin) - so this might need spec changes.

erikjohnston · 2022-07-29T13:19:25Z

Can we do a out-of-band leave like we do for rejecting invites? I think that would end up doing roughly the right thing? I'm kinda assuming this situation would be rare enough that we don't need to worry too much about making the UX slick, so long as we end up in a sane state.

squahtx · 2022-09-30T13:38:46Z

As part of this issue, we'll want to enable / fix a skipped device list tracking test from Faster room joins: Add abandoned join tests for device list tracking complement#476.

H-Shay · 2023-09-08T17:30:33Z

So I took a stab at this and have a branch where I did an out-of-band leave when syncing hit the total failure state (and a test for this). However, I then realized that the code that I called to process the leave was only defined on the master, and so this solution would not work for worker instances. This is as far as I got with it. I've pushed the branch here if that's helpful for anyone.

richvdh added this to the Faster joins (further work) milestone Jun 9, 2022

richvdh added A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jun 9, 2022

kittykat mentioned this issue Oct 3, 2022

Improve time to join a remote room #14030

Closed

16 tasks

richvdh removed this from the Q3 2022: Faster joins: fix major known bugs for monoliths milestone Oct 5, 2022

erikjohnston added this to the Remaining faster remote join work milestone Mar 9, 2023

H-Shay self-assigned this Aug 14, 2023

H-Shay removed their assignment Sep 8, 2023

matrixbot mentioned this issue Dec 21, 2023

Faster joins: handle total failure to sync state element-hq/synapse#13000

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster joins: handle total failure to sync state #13000

Faster joins: handle total failure to sync state #13000

richvdh commented Jun 9, 2022 •

edited

Loading

reivilibre commented Jul 28, 2022

richvdh commented Jul 28, 2022

erikjohnston commented Jul 29, 2022

squahtx commented Sep 30, 2022 •

edited by H-Shay

Loading

H-Shay commented Sep 8, 2023 •

edited

Loading

Faster joins: handle total failure to sync state #13000

Faster joins: handle total failure to sync state #13000

Comments

richvdh commented Jun 9, 2022 • edited Loading

reivilibre commented Jul 28, 2022

richvdh commented Jul 28, 2022

erikjohnston commented Jul 29, 2022

squahtx commented Sep 30, 2022 • edited by H-Shay Loading

H-Shay commented Sep 8, 2023 • edited Loading

richvdh commented Jun 9, 2022 •

edited

Loading

squahtx commented Sep 30, 2022 •

edited by H-Shay

Loading

H-Shay commented Sep 8, 2023 •

edited

Loading