Synapse takes forever to send large responses - dead time gap after `encode_json_response` #17722

MadLittleMods · 2024-09-17T17:16:21Z

Synapse takes forever to send large responses. It takes us longer to send just the response than it does for us to process and encode_json_response in some cases.

Examples

98s to process the request and encode_json_response but the request isn't finished sending until 484s (8 minutes) which is 6.5 minutes of dead-time. The response size is 36MB

Jaeger trace: 4238bdbadd9f3077.json

Jaeger trace with big gap for sending the response

59s to process and finishes after 199s. The response size is 36MB

Jaeger trace: 2149cc5e59306446.json

I've come across this before and it's not a new thing. For example in #13620, I described as the "mystery gap at the end after we encode the JSON response (encode_json_response)" but never encountered it being this egregious.

It can also happen for small requests. 2s to process and finishes after 5s. The response size is 120KB

Investigation

@kegsay pointed out _write_bytes_to_request which runs after encode_json_response and has comments like "Write until there's backpressure telling us to stop." that definitely hint at some areas of interest.

synapse/synapse/http/server.py

Lines 869 to 873 in 03937a1

    
           with start_active_span("encode_json_response"): 
        
               span = active_span() 
        
               json_str = await defer_to_thread(request.reactor, encode, span) 
        
           _write_bytes_to_request(request, json_str)

The JSON serialization is done in a background thread because it can block the reactor for many seconds. This part seems normal and fast (no problem).

But we also use _ByteProducer to send the bytes down to the client. Using a producer ensures we can send down all of the bytes to the client without hitting a 60s timeout (see context in comments below)

synapse/synapse/http/server.py

Lines 883 to 889 in d40bc27

    
           # The problem with dumping all of the response into the `Request` object at 
        
           # once (via `Request.write`) is that doing so starts the timeout for the 
        
           # next request to be received: so if it takes longer than 60s to stream back 
        
           # the response to the client, the client never gets it. 
        
           # 
        
           # The correct solution is to use a Producer; then the timeout is only 
        
           # started once all of the content is sent over the TCP connection.

This logic was added in:

Some extra time is expected as we're working with the reactor instead of blocking it but it seems like something isn't tuned optimally (chunk size, starting/stopping too much, etc)

The text was updated successfully, but these errors were encountered:

reivilibre · 2024-09-17T17:18:01Z

It'd probably be good to check if the worker was overloaded at the time the trace was taken (i.e. it's just CPU contention), or if this is something different.

MadLittleMods · 2024-09-17T17:33:41Z

I've updated the description with the Jaeger traces if you're curious about the timings. Seems like the CPU was busy but not totally overloaded.

https://grafana.matrix.org/d/000000012/synapse?orgId=1&var-datasource=default&var-bucket_size=$__auto_interval_bucket_size&var-instance=matrix.org&var-job=synapse_sliding_sync&var-index=All&from=1726513200000&to=1726516800000

Hour where the requests happened:

Grafana CPU graph from the Sliding Sync worker during the time of the example requests. Busy but not peaked.

The whole day yesterday:

kegsay · 2024-09-18T16:05:22Z

Additional data point:

for an initial /sync in SSS.

MadLittleMods mentioned this issue Sep 17, 2024

/state_ids is slow to respond #13620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synapse takes forever to send large responses - dead time gap after `encode_json_response` #17722

Synapse takes forever to send large responses - dead time gap after `encode_json_response` #17722

MadLittleMods commented Sep 17, 2024 •

edited

Loading

reivilibre commented Sep 17, 2024

MadLittleMods commented Sep 17, 2024

kegsay commented Sep 18, 2024

Synapse takes forever to send large responses - dead time gap after encode_json_response #17722

Synapse takes forever to send large responses - dead time gap after encode_json_response #17722

Comments

MadLittleMods commented Sep 17, 2024 • edited Loading

Examples

Investigation

reivilibre commented Sep 17, 2024

MadLittleMods commented Sep 17, 2024

kegsay commented Sep 18, 2024

Synapse takes forever to send large responses - dead time gap after `encode_json_response` #17722

Synapse takes forever to send large responses - dead time gap after `encode_json_response` #17722

MadLittleMods commented Sep 17, 2024 •

edited

Loading