possible request handler memory leak in TransportService

We recently discovered a memory leak in the `TransportService` in a production use case of ES v2.4. Of course, v2.4 is old and EOL (and we're working on upgrading!), but we took a look at ES7 and believe it would have the same behavior as the relevant code paths are quite similar.

The problem lies in the ability to create unbounded growth in the `requestHandlers` map if certain requests never receive responses.

### How request handlers are removed

From our understanding reading through the code, a `requestHandler` is removed when:

1. Sending a request fails to be submitted to the network: https://github.com/elastic/elasticsearch/blob/694229f0cdffe3ec7a75c05c138bfc170c3f59e7/server/src/main/java/org/elasticsearch/transport/TransportService.java#L731-L735
1. A request times out, _but this only applies if a transport timeout was specified_: https://github.com/elastic/elasticsearch/blob/694229f0cdffe3ec7a75c05c138bfc170c3f59e7/server/src/main/java/org/elasticsearch/transport/TransportService.java#L1059-L1065
1. When a response for the given request is received: https://github.com/elastic/elasticsearch/blob/694229f0cdffe3ec7a75c05c138bfc170c3f59e7/server/src/main/java/org/elasticsearch/transport/Transport.java#L224-L232
1. The transport connection is closed: https://github.com/elastic/elasticsearch/blob/694229f0cdffe3ec7a75c05c138bfc170c3f59e7/server/src/main/java/org/elasticsearch/transport/TransportService.java#L1019-L1023

### What can go wrong

In our case, we saw the `requestHandlers` (called `clientHandlers` in v2.4) balloon to 50K+ entries, many of which were for requests that were days old. What happened was:

1. A scroll request came in (which does not get a transport timeout by default, invalidating removal option (2))
2. The scroll request was successfully submitted to the network (invalidating removal option (1))
3. A network partition meant a response from the destination node was never received, but the overall connection stayed open (invalidating removal options (3) and (4))

These are screencaps from a v2.4 heap dump, but since it has the same 4 request handler removal options I believe they similarly apply to ES7:

![Screen Shot 2021-05-06 at 9 25 04 AM](https://user-images.githubusercontent.com/785446/117473099-99920100-af27-11eb-9583-ff97c8be1c0b.png)
*By default, client requests do not receive transport timeouts*

![Screen Shot 2021-05-06 at 9 17 48 AM](https://user-images.githubusercontent.com/785446/117473295-cba36300-af27-11eb-9626-b69c85746000.png)
*A request never received a response from a single destination node, causing the entire pending request+response to be permanently stuck in memory*

![Screen Shot 2021-05-07 at 11 34 41 AM](https://user-images.githubusercontent.com/785446/117473704-3d7bac80-af28-11eb-93f3-de1ec402b07b.png)
*Over time this caused a huge buildup of stale handler entries, taking over 12GB of heap. (yes, we know those 1GB response entries aren't great, yes we're working on fixing them :) )*

### Possible solutions

Assuming our analysis of ES7 is sound and this could still be a problem in the latest builds, some possible solutions:

1. Always set a transport timeout on requests, to ensure requests are eventually cleaned up. For known long-running requests this might need to be adapted to verify the requests are still making forward progress and re-schedule the timeouts.
2. Use a reaper thread to comb through the `requestHandlers` map with some frequency to see if any requests are stuck
3. Retrying a request if no response is received after a certain period of time

This issue for v6.2.4 also seems quite relevant and is very similar to the heap dumps we pulled from v2.4: https://discuss.elastic.co/t/heap-memory-leak-in-elasticsearch-6-2-4/214419


	connection.sendRequest(requestId, action, request, options); // local node optimization happens upstream
	} catch (final Exception e) {
	// usually happen either because we failed to connect to the node
	// or because we failed serializing the message
	final Transport.ResponseContext<? extends TransportResponse> contextToNotify = responseHandlers.remove(requestId);

	@Override
	public void run() {
	if (responseHandlers.contains(requestId)) {
	long timeoutTime = threadPool.relativeTimeInMillis();
	timeoutInfoHandlers.put(requestId, new TimeoutInfoHolder(node, action, sentTime, timeoutTime));
	// now that we have the information visible via timeoutInfoHandlers, we try to remove the request id
	final Transport.ResponseContext<? extends TransportResponse> holder = responseHandlers.remove(requestId);

	/**
	* called by the {@link Transport} implementation when a response or an exception has been received for a previously
	* sent request (before any processing or deserialization was done). Returns the appropriate response handler or null if not
	* found.
	*/
	public TransportResponseHandler<? extends TransportResponse> onResponseReceived(final long requestId,
	final TransportMessageListener listener) {
	ResponseContext<? extends TransportResponse> context = handlers.remove(requestId);
	listener.onResponseReceived(requestId, context);

	@Override
	public void onConnectionClosed(Transport.Connection connection) {
	try {
	List<Transport.ResponseContext<? extends TransportResponse>> pruned =
	responseHandlers.prune(h -> h.connection().getCacheKey().equals(connection.getCacheKey()));

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

possible request handler memory leak in TransportService #72852

How request handlers are removed

What can go wrong

Possible solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

possible request handler memory leak in TransportService #72852

Description

How request handlers are removed

What can go wrong

Possible solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions