Add indexing pressure tracking to OTLP endpoints#144009
Add indexing pressure tracking to OTLP endpoints#144009felixbarny merged 7 commits intoelastic:mainfrom
Conversation
Wire IndexingPressure into AbstractOTLPRestAction so protobuf request bodies are streamed and accumulated under coordinating memory pressure, bounded by http.max_protobuf_content_length. Refactor IndexingPressureAwareContentAggregator to own the full reservation lifecycle: it now takes IndexingPressure directly and performs the reservation in accept(), routing failures through the onFailure callback instead of throwing. This ensures both 413 (body too large) and 429 (pressure rejected) produce format-appropriate responses for OTLP and Prometheus endpoints. Add AbstractOTLPRestActionTests with coverage for success, empty body, transport failure, 413, and 429 paths.
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
|
Hi @felixbarny, I've created a changelog YAML for you. |
|
I confirmed in the benchmarking environment that there are no regressions in terms of ingest throughput related to this change. |
...in/otel-data/src/main/java/org/elasticsearch/xpack/oteldata/otlp/AbstractOTLPRestAction.java
Show resolved
Hide resolved
|
LGTM, accepting to unblock. Let's make sure @DaveCTurner review this too. |
DaveCTurner
left a comment
There was a problem hiding this comment.
LGTM, accepting to unblock. Let's make sure @DaveCTurner review this too.
I don't understand the purpose of this. One approval is enough to merge, so if you don't think your own review is enough, please don't make it an approving review.
I will take a deeper look this afternoon.
Pass BodyPostProcessor.NOOP to IndexingPressureAwareContentAggregator where the new parameter was added in main but not yet in callers on this branch.
DaveCTurner
left a comment
There was a problem hiding this comment.
Production changes all look good. However, given the importance of having the right response codes on overload (413 vs 429) I would prefer that we cover these cases in a javaRestTest rather than just in AbstractOTLPRestActionTests. I suggest adjusting AbstractOTLPIndexingRestIT to set indexing_pressure.memory.coordinating.limit to something less than 8MiB on the test node and then verify that requests which exceed the limit get a 429 and ones which exceed 8MiB get a 413.
|
Thanks for the suggestion, David. I've added tests now. PTAL. |
…elocations * upstream/main: (72 commits) [Test] Randomly disable sequence numbers in CcrTimeSeriesDataStreamsIT (elastic#143930) Fix AsyncSearchIndexServiceTests.testCircuitBreaker failure (elastic#144058) Refine GenerativeIT some more, this time with accounting for some added (elastic#144220) ESQL: Physical Planning on the Lookup Node (elastic#143707) Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:approximation.Approximate stats by with zero variance} elastic#144240 Trigger counter metrics in test for delta temporality measurements (elastic#144193) fix capabiltiy approximation_v3 (elastic#144230) [ci] Add PR pipeline for testing ipv6 and fix tests not working with ipv6 (elastic#140473) update (elastic#144095) Make from/to optional in TBUCKET when Kibana timestamp filter is present (elastic#144057) Extract reroute behavior from create-index request classes (elastic#144140) ESQL: Fix release build only failures (elastic#144122) ES|QL query approximation: move sample correction to data node (elastic#144005) Add indexing pressure tracking to OTLP endpoints (elastic#144009) Fix replica writes after _seq_no doc values are pruned (elastic#144180) allow tests to configure supportsLoadingConfig (elastic#144061) [ES|QL] Unmute testGiantTextFieldInSubqueryIntermediateResultsWithSort (elastic#144126) [ESQL][DOCS] Add CPS page (unpublished for moment) (elastic#144206) ESQL: Forbid "load" unmapped_fields for certain commands (elastic#144115) Add CCS Remote Views Detection (elastic#143384) ...
Wire IndexingPressure into AbstractOTLPRestAction so protobuf request bodies are streamed and accumulated under coordinating memory pressure, bounded by http.max_protobuf_content_length. Refactor IndexingPressureAwareContentAggregator to own the full reservation lifecycle: it now takes IndexingPressure directly and performs the reservation in accept(), routing failures through the onFailure callback instead of throwing. This ensures both 413 (body too large) and 429 (pressure rejected) produce format-appropriate responses for OTLP and Prometheus endpoints. Add AbstractOTLPRestActionTests with coverage for success, empty body, transport failure, 413, and 429 paths.
Wire IndexingPressure into AbstractOTLPRestAction so protobuf request bodies are streamed and accumulated under coordinating memory pressure, bounded by http.max_protobuf_content_length. Refactor IndexingPressureAwareContentAggregator to own the full reservation lifecycle: it now takes IndexingPressure directly and performs the reservation in accept(), routing failures through the onFailure callback instead of throwing. This ensures both 413 (body too large) and 429 (pressure rejected) produce format-appropriate responses for OTLP and Prometheus endpoints. Add AbstractOTLPRestActionTests with coverage for success, empty body, transport failure, 413, and 429 paths.
Adds indexing pressure tracking to OTLP REST endpoints so that protobuf request bodies are accumulated under memory backpressure, matching the existing behavior for Prometheus remote write.
AbstractOTLPRestAction.prepareRequestnow returns anIndexingPressureAwareContentAggregatorthat streams the request body chunk-by-chunk. The aggregator reserves coordinating memory up front (bounded byhttp.max_protobuf_content_length) and rejects oversized bodies with 413 or overloaded nodes with 429 before buffering the full payload. On failure, the response uses the correctapplication/x-protobufcontent type with an empty body, so OTLP clients can rely on the HTTP status code for retry decisions.The
IndexingPressureAwareContentAggregatoritself was refactored to own the full reservation lifecycle: it now acceptsIndexingPressureinstead of a pre-acquiredCoordinatinghandle and performs the reservation insideaccept(). If the reservation is rejected (EsRejectedExecutionException), theonFailurecallback is invoked rather than letting the exception escape into the generic REST handler infrastructure. This ensures both 413 and 429 errors go through the same callback, giving callers full control over the error response format. This benefits both OTLP and Prometheus remote write endpoints.Added
AbstractOTLPRestActionTests/OTLPMetricsRestActionTestscovering the success path, empty body handling, transport action failure, oversized body rejection (413), and indexing pressure rejection (429). The abstract test class mirrorsAbstractOTLPTransportActionTeststo make adding tests for future signals straightforward.