Skip to content

Implement chunked fetch streaming with circuit breaker integration#139124

Merged
drempapis merged 447 commits intoelastic:mainfrom
drempapis:chunked_fetch_phase
Mar 27, 2026
Merged

Implement chunked fetch streaming with circuit breaker integration#139124
drempapis merged 447 commits intoelastic:mainfrom
drempapis:chunked_fetch_phase

Conversation

@drempapis
Copy link
Copy Markdown
Contributor

@drempapis drempapis commented Dec 5, 2025

In the current implementation, when Elasticsearch executes a search query that returns a large number of documents, the fetch phase retrieves the actual document content from each shard, which can lead to significant memory pressure on data nodes.

  • Data Node
    • All SearchHit objects are built and held in memory simultaneously before being serialized and sent to the coordinator. For large result sets (e.g., 1000 or more documents with nested fields), this can consume gigabytes of heap memory.
  • Transport
    • Big messages are transferred through the network.
  • Coordinator Node
    • Receives the complete response from each shard at once, accumulating all hits in memory before building the final response. With multiple shards, memory usage multiplies even for one query.
  • Result
    • OutOfMemoryError (OOM) crashes, especially during concurrent large queries or when document sizes are unpredictable.

This PR implements chunked streaming for the fetch phase to reduce memory pressure when handling large result sets. Instead of accumulating all search hits in memory on the data node before sending them to the coordinator, hits are streamed in configurable chunks (default: 256 KB) as they are produced. Memory usage is bounded by circuit breakers on both the data and coordinator nodes.

How OOM is Prevented on the Data Node

  • Immediate Serialization
    • Each SearchHit is serialized to bytes immediately after fetching, then the object is released. The bytes are enqueued in chunks for processing..
  • Byte-Based Chunking (default 256KB)
    • Chunks are emitted when serialized bytes exceed the 256KB threshold. This bounds the maximum buffer size regardless of document count or size.
  • Circuit Breaker Reservation
    • Before each chunk is enqueued for sending, memory is reserved via CB.addEstimateBytesAndMaybeBreak(). If the breaker trips (too much memory), the operation fails fast with CircuitBreakingException instead of OOM.
    • Circuit breaker memory accounting is more accurate in this implementation. It tracks the full serialized SearchHit size (including all fields, metadata, and nested structures), whereas the traditional implementation only accounts for the _source field bytes.
  • ThrottledTaskRunner Backpressure
    • Limits concurrent in-flight chunks to maxInFlightChunks. When at capacity, new chunks queue internally. This prevents unbounded chunk accumulation when the coordinator is slow.
  • ACK-Based Memory Release
    • Circuit breaker memory is released only when the coordinator ACKs each chunk. This creates natural backpressure, if the coordinator is slow, data node memory stays high, eventually tripping the circuit-breaker.

How OOM is Prevented on the Coordinator Node

  • Incremental Chunk Reception
    • Instead of receiving all hits at once, the coordinator receives small chunks (>= 256KB each). Memory grows incrementally as chunks arrive.
  • Circuit Breaker Tracking
    • FetchPhaseResponseStream tracks accumulated bytes and reserves memory on the coordinator's circuit breaker (for all shards). If breaker trips, new chunks are rejected.
  • ACK Flow Control
    • The coordinator only ACKs a chunk after successfully processing it. If the coordinator is overwhelmed, it stops ACKing, which throttles the data node via backpressure.
  • Cleanup on Failure
    • If any error occurs, closeInternal() releases all circuit breaker bytes and cleans up accumulated hits, preventing memory leaks.

Flow Diagram

image

The implementation followed the paradigm of TransportRepositoryVerifyIntegrityCoordinationAction but it streams only between the coordinator and data-nodes.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it :)

@drempapis drempapis added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations >refactoring labels Dec 11, 2025
@drempapis
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/part-2

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of thoughts about blocking of threads.

@drempapis
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/part-1

@drempapis
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/part-2

return false;
}

nextChunk = queue.poll(100, TimeUnit.MILLISECONDS);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is definitely better because at least it's only blocking a thread while fetching the docs locally, but now we need two threads.

I'm partly to blame for suggesting a ThrottledIterator here. That would have worked if we could have moved the fetch-from-disk process between threads but it doesn't fit here given the single-threadedness constraint. I think instead we need a new ThrottledTaskRunner("fetch", maxInFlightChunks, EsExecutors.DIRECT_EXECUTOR_SERVICE) to manage the queue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner thank you for the feedback!

I want to make sure I understand correctly. When you say use ThrottledTaskRunner with DIRECT_EXECUTOR_SERVICE, do you mean

  1. Eliminate the producer-consumer pattern entirely and have the Lucene thread enqueue send tasks directly to ThrottledTaskRunner, which runs them inline when under capacity
  2. Keep the producer-consumer pattern, but replace ThrottledIterator with ThrottledTaskRunner on the consumer

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the implementation to use ThrottledTaskRunner

The iterateAsync method now uses a single ThrottledTaskRunner("fetch", maxInFlightChunks, DIRECT_EXECUTOR_SERVICE) to manage chunk sends.

  • The calling thread fetches documents sequentially, serializes hits into chunks, and enqueues send tasks directly to the ThrottledTaskRunner
  • Tasks run immediately on the calling thread via DIRECT_EXECUTOR_SERVICE when fewer than maxInFlightChunks are in flight
  • When at the limit, tasks queue internally until ACK callbacks signal completion, which triggers queued tasks

This is better than the custom-made producer/Consumer implementation. Νo thread blocks waiting for network I/O, and the producer thread is freed immediately after enqueueing, while memory usage is throttled by the CB on the data nodes to protect from OOM

@drempapis
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/part-2

@drempapis
Copy link
Copy Markdown
Contributor Author

Buildkite benchmark this with noaa-3n-2g please

@drempapis
Copy link
Copy Markdown
Contributor Author

Buildkite benchmark this with esql please

@drempapis
Copy link
Copy Markdown
Contributor Author

Buildkite benchmark this with geoshape please

@elasticmachine
Copy link
Copy Markdown
Collaborator

elasticmachine commented Mar 22, 2026

💔 Build Failed

Failed CI Steps

This build attempts two geoshape benchmarks to evaluate performance impact of this PR. To estimate benchmark completion time inspect previous nightly runs here.

History

@drempapis drempapis merged commit a7e2068 into elastic:main Mar 27, 2026
36 checks passed
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>refactoring :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants