Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synapse takes forever to send large responses - dead time gap after encode_json_response #17722

Open
MadLittleMods opened this issue Sep 17, 2024 · 3 comments

Comments

@MadLittleMods
Copy link
Collaborator

MadLittleMods commented Sep 17, 2024

Synapse takes forever to send large responses. It takes us longer to send just the response than it does for us to process and encode_json_response in some cases.

Examples

98s to process the request and encode_json_response but the request isn't finished sending until 484s (8 minutes) which is 6.5 minutes of dead-time. The response size is 36MB

Jaeger trace: 4238bdbadd9f3077.json

Jaeger trace with big gap for sending the response

59s to process and finishes after 199s. The response size is 36MB

Jaeger trace: 2149cc5e59306446.json

Jaeger trace with big gap for sending the response

I've come across this before and it's not a new thing. For example in #13620, I described as the "mystery gap at the end after we encode the JSON response (encode_json_response)" but never encountered it being this egregious.

It can also happen for small requests. 2s to process and finishes after 5s. The response size is 120KB

Jaeger trace with big gap for sending the response

Investigation

@kegsay pointed out _write_bytes_to_request which runs after encode_json_response and has comments like "Write until there's backpressure telling us to stop." that definitely hint at some areas of interest.

with start_active_span("encode_json_response"):
span = active_span()
json_str = await defer_to_thread(request.reactor, encode, span)
_write_bytes_to_request(request, json_str)

The JSON serialization is done in a background thread because it can block the reactor for many seconds. This part seems normal and fast (no problem).

But we also use _ByteProducer to send the bytes down to the client. Using a producer ensures we can send down all of the bytes to the client without hitting a 60s timeout (see context in comments below)

# The problem with dumping all of the response into the `Request` object at
# once (via `Request.write`) is that doing so starts the timeout for the
# next request to be received: so if it takes longer than 60s to stream back
# the response to the client, the client never gets it.
#
# The correct solution is to use a Producer; then the timeout is only
# started once all of the content is sent over the TCP connection.

This logic was added in:

Some extra time is expected as we're working with the reactor instead of blocking it but it seems like something isn't tuned optimally (chunk size, starting/stopping too much, etc)

@reivilibre
Copy link
Contributor

It'd probably be good to check if the worker was overloaded at the time the trace was taken (i.e. it's just CPU contention), or if this is something different.

@MadLittleMods
Copy link
Collaborator Author

I've updated the description with the Jaeger traces if you're curious about the timings. Seems like the CPU was busy but not totally overloaded.

https://grafana.matrix.org/d/000000012/synapse?orgId=1&var-datasource=default&var-bucket_size=$__auto_interval_bucket_size&var-instance=matrix.org&var-job=synapse_sliding_sync&var-index=All&from=1726513200000&to=1726516800000

Hour where the requests happened:

Grafana CPU graph from the Sliding Sync worker during the time of the example requests. Busy but not peaked.

The whole day yesterday:

Grafana CPU graph from the Sliding Sync worker for the whole day yesterday. Busy but not peaked.

@kegsay
Copy link
Contributor

kegsay commented Sep 18, 2024

Additional data point:
Screenshot 2024-09-18 at 17 04 47

for an initial /sync in SSS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants