Defer circuit breaker release until transport write completes#143136
Defer circuit breaker release until transport write completes#143136drempapis merged 70 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
|
@DaveCTurner Could you share your thoughts on the approach we took here? Before his vacation, Andrei mentioned that you had discussed another possible solution, but I didn’t get many details. If this approach isn’t the right one, could you guide me on how you think we should move forward? |
DaveCTurner
left a comment
There was a problem hiding this comment.
Ah no we don't want to do this. Outbound transport messages already have this lifecycle, let's not add another one.
The exact behaviour depends on whether you're using a BytesTransportResponse or not. With a BytesTransportResponse the content is retained via ref-counting, with a decRef() once the message is sent. With other responses the message content is copied into the final network message before sendResponse returns, so there's no need to retain anything after that point.
|
Sorry I forgot to highlight the impact on CBs before hitting "send". If the messages are large enough to warrant CB integration then they should be using a |
server/src/test/java/org/elasticsearch/action/search/AsBytesResponseTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/SearchTransportService.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/search/AsBytesResponseTests.java
Outdated
Show resolved
Hide resolved
...rc/test/java/org/elasticsearch/action/search/SearchTransportServiceChannelListenerTests.java
Show resolved
Hide resolved
...rc/test/java/org/elasticsearch/action/search/SearchTransportServiceChannelListenerTests.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/search/AsBytesResponseTests.java
Outdated
Show resolved
Hide resolved
...rc/test/java/org/elasticsearch/action/search/SearchTransportServiceChannelListenerTests.java
Show resolved
Hide resolved
spinscale
left a comment
There was a problem hiding this comment.
Only left minor nits, that I'll leave up to you.
andreidan
left a comment
There was a problem hiding this comment.
LGTM, thanks Dimitris 🚀
…c#143136) - The fetch phase builds the search hits and reserves circuit-breaker bytes for them. - On the network path, the response is serialized into a circuit-breaker-aware byte stream. At this point, both the response objects and the serialized bytes are on the heap and tracked by the breaker. - As soon as serialization completes, the search layer releases the response object bytes in a finally block. The heavy Java objects are no longer needed and can be garbage-collected. - The serialized bytes remain tracked on the breaker while they sit in Netty's write queue. When Netty finishes writing and drops its reference, the pages are freed and the breaker is decremented. - On the direct (same-node) path, no serialization happen, the response is forwarded as-is, and the search layer releases the response object bytes immediately after hand-off.
…c#143136) - The fetch phase builds the search hits and reserves circuit-breaker bytes for them. - On the network path, the response is serialized into a circuit-breaker-aware byte stream. At this point, both the response objects and the serialized bytes are on the heap and tracked by the breaker. - As soon as serialization completes, the search layer releases the response object bytes in a finally block. The heavy Java objects are no longer needed and can be garbage-collected. - The serialized bytes remain tracked on the breaker while they sit in Netty's write queue. When Netty finishes writing and drops its reference, the pages are freed and the breaker is decremented. - On the direct (same-node) path, no serialization happen, the response is forwarded as-is, and the search layer releases the response object bytes immediately after hand-off.
Problem
When a search response is ready,
SearchServiceserializes it and hands it to Netty for writing to the network. The problem is that Netty'swriteAndFlushqueues the bytes into an internal buffer and returns; the data hasn't left the JVM yet.In the current implementation, the circuit breaker reservation is released right after this synchronous return. Under normal conditions, the gap is tiny, but when the network is slow, or many responses are in flight at the same time, bytes pile up in Netty's write queue while the breaker thinks the memory is free. This lets the node accept more work than it can handle, potentially leading to OOM.
The breaker tracks the search hit objects, not the serialized bytes. Those objects are freed (via decRef) shortly after serialization. However, a serialized copy of that same data still sits in Netty's buffer, consuming heap until the write finishes.
What changed