retry federation requests on 5xx errors #8915

richvdh · 2020-12-10T13:16:05Z

federationhttpclient has a bunch of logic to retry requests, but it only fires for connection failures, 429 responses, and (optionally) dns lookup errors. Why don't we retry on 5xx errors?

(For example: we sometimes see 520 errors from matrix.org. Arguably that's indicative of a problem, but I think we could be robust to it.)

The text was updated successfully, but these errors were encountered:

erikjohnston · 2020-12-15T11:52:07Z

I think this is more of a thinko that we should just fix. I think we should just extend:

synapse/synapse/http/matrixfederationclient.py

Lines 523 to 528 in 30fba62

    
           # Retry if the error is a 429 (Too Many Requests), 
        
           # otherwise just raise a standard HttpResponseException 
        
           if response.code == 429: 
        
               raise RequestSendFailed(exc, can_retry=True) from exc 
        
           else: 
        
               raise exc

Though I wonder if we should not retry quite so often on 500 errors, as they should(?) be more likely due to bad request, but that isn't always true.

richvdh · 2020-12-15T12:00:08Z

Though I wonder if we should not retry quite so often on 500 errors, as they should(?) be more likely due to bad request

a bad request should cause a 400 error :-p

richvdh · 2020-12-15T12:02:27Z

In fact I would say the very thing that distinguishes a 500 from a 400 is that a 500 says "please try the same thing again, you might have more luck" (whereas a 400 says "stop doing that please")

erikjohnston · 2020-12-15T12:16:41Z

Yeah, that's fair

y-pankaj · 2021-01-22T15:36:17Z

Hi! I would like to take up this issue.
As mentioned before by @erikjohnston, should I just extend the if statement and change the comment as well?

turt2live · 2021-01-26T22:48:12Z

503 backing off for 10 minutes is also a bit harsh:

[federation_sender_1] 2021-01-26 22:33:41,773 - synapse.http.matrixfederationclient - 217 - INFO - federation_transaction_transmission_loop-93440- {PUT-O-78425} [fosdem.org] Completed request: 200 OK in 0.07 secs - PUT matrix://fosdem.org/_matrix/federation/v1/send/1611673272139
[federation_sender_1] 2021-01-26 22:33:41,773 - synapse.federation.sender.transaction_manager - 163 - INFO - federation_transaction_transmission_loop-93440- TX [fosdem.org] {1611673272139} got 200 response
[federation_sender_1] 2021-01-26 22:33:45,254 - synapse.federation.sender.transaction_manager - 127 - INFO - federation_transaction_transmission_loop-93445- TX [fosdem.org] {1611673272144} Sending transaction [1611673272144], (PDUs: 0, EDUs: 1)
[federation_sender_1] 2021-01-26 22:33:45,281 - synapse.http.matrixfederationclient - 217 - INFO - federation_transaction_transmission_loop-93445- {PUT-O-78430} [fosdem.org] Completed request: 200 OK in 0.03 secs - PUT matrix://fosdem.org/_matrix/federation/v1/send/1611673272144
[federation_sender_1] 2021-01-26 22:33:45,282 - synapse.federation.sender.transaction_manager - 163 - INFO - federation_transaction_transmission_loop-93445- TX [fosdem.org] {1611673272144} got 200 response
[federation_sender_1] 2021-01-26 22:33:55,234 - synapse.federation.sender.transaction_manager - 127 - INFO - federation_transaction_transmission_loop-93449- TX [fosdem.org] {1611673272148} Sending transaction [1611673272148], (PDUs: 0, EDUs: 1)
[federation_sender_1] 2021-01-26 22:33:55,248 - synapse.http.matrixfederationclient - 508 - INFO - federation_transaction_transmission_loop-93449- {PUT-O-78434} [fosdem.org] Got response headers: 503 Service Temporarily Unavailable
[federation_sender_1] 2021-01-26 22:33:55,249 - synapse.http.matrixfederationclient - 586 - WARNING - federation_transaction_transmission_loop-93449- {PUT-O-78434} [fosdem.org] Request failed: PUT matrix://fosdem.org/_matrix/federation/v1/send/1611673272148: HttpResponseException('503: Service Temporarily Unavailable',)
[federation_sender_1] 2021-01-26 22:33:55,249 - synapse.util.retryutils - 215 - INFO - federation_transaction_transmission_loop-93449- Connection to fosdem.org was unsuccessful (<class 'synapse.api.errors.HttpResponseException'>(503: Service Temporarily Unavailable)); backoff now 600000
[federation_sender_1] 2021-01-26 22:33:55,249 - synapse.federation.sender.transaction_manager - 159 - INFO - federation_transaction_transmission_loop-93449- TX [fosdem.org] {1611673272148} got 503 response
[federation_sender_1] 2021-01-26 22:33:55,250 - synapse.federation.sender.per_destination_queue - 411 - WARNING - federation_transaction_transmission_loop-93449- TX [fosdem.org] Received 503 response to transaction: 503: Service Temporarily Unavailable

My server was happily talking to fosdem.org for a long while, but then fosdem.org was restarted which lead to a single 503 - the backoff is a bit harsh here.

ShadowJonathan · 2021-02-08T18:41:22Z

Why is the backoff a flat 10 minutes, anyways? Why not exponential backoff - from 10 seconds - that caps at around 15 minutes?

richvdh · 2021-02-09T12:08:23Z

it is exponential. It starts at 10 minutes because it assumes (incorrectly, per this issue) that by that point you've already tried a few times. See also #5406 (comment) for somewhere I've written about this in the past.

Fixes #8915

richvdh mentioned this issue Dec 10, 2020

we should be more intelligent with backoff for federation requests #8917

Open

erikjohnston added z-bug (Deprecated Label) z-p2 (Deprecated Label) labels Dec 15, 2020

callahad added the good first issue label Jan 6, 2021

richvdh mentioned this issue Jan 26, 2021

It's a bit harsh that a single 503 error means the server doesn't receive traffic for 10 minutes #9233

Closed

richvdh mentioned this issue Feb 8, 2021

Synapse should not back off federation for 10 minutes due to a single received 5xx error #9335

Closed

richvdh mentioned this issue Feb 10, 2021

we should backoff on 40x errors #5442

Open

richvdh added the good first issue Good for newcomers label Mar 2, 2021

erikjohnston mentioned this issue Mar 9, 2021

Retry 5xx errors in federation client #9567

Merged

erikjohnston added a commit that referenced this issue Mar 9, 2021

Retry 5xx errors in federation client

bd6ee75

Fixes #8915

erikjohnston closed this as completed in #9567 Mar 9, 2021

erikjohnston added a commit that referenced this issue Mar 9, 2021

Retry 5xx errors in federation client (#9567)

9cd18cc

Fixes #8915

matrixbot mentioned this issue Dec 21, 2023

we should be more intelligent with backoff for federation requests element-hq/synapse#8917

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retry federation requests on 5xx errors #8915

retry federation requests on 5xx errors #8915

richvdh commented Dec 10, 2020

erikjohnston commented Dec 15, 2020

richvdh commented Dec 15, 2020

richvdh commented Dec 15, 2020

erikjohnston commented Dec 15, 2020

y-pankaj commented Jan 22, 2021

turt2live commented Jan 26, 2021

ShadowJonathan commented Feb 8, 2021

richvdh commented Feb 9, 2021

retry federation requests on 5xx errors #8915

retry federation requests on 5xx errors #8915

Comments

richvdh commented Dec 10, 2020

erikjohnston commented Dec 15, 2020

richvdh commented Dec 15, 2020

richvdh commented Dec 15, 2020

erikjohnston commented Dec 15, 2020

y-pankaj commented Jan 22, 2021

turt2live commented Jan 26, 2021

ShadowJonathan commented Feb 8, 2021

richvdh commented Feb 9, 2021