Skip to content

Add indexing pressure tracking to OTLP endpoints#144009

Merged
felixbarny merged 7 commits intoelastic:mainfrom
felixbarny:otlp-indexing-pressure
Mar 13, 2026
Merged

Add indexing pressure tracking to OTLP endpoints#144009
felixbarny merged 7 commits intoelastic:mainfrom
felixbarny:otlp-indexing-pressure

Conversation

@felixbarny
Copy link
Copy Markdown
Member

Adds indexing pressure tracking to OTLP REST endpoints so that protobuf request bodies are accumulated under memory backpressure, matching the existing behavior for Prometheus remote write.

AbstractOTLPRestAction.prepareRequest now returns an IndexingPressureAwareContentAggregator that streams the request body chunk-by-chunk. The aggregator reserves coordinating memory up front (bounded by http.max_protobuf_content_length) and rejects oversized bodies with 413 or overloaded nodes with 429 before buffering the full payload. On failure, the response uses the correct application/x-protobuf content type with an empty body, so OTLP clients can rely on the HTTP status code for retry decisions.

The IndexingPressureAwareContentAggregator itself was refactored to own the full reservation lifecycle: it now accepts IndexingPressure instead of a pre-acquired Coordinating handle and performs the reservation inside accept(). If the reservation is rejected (EsRejectedExecutionException), the onFailure callback is invoked rather than letting the exception escape into the generic REST handler infrastructure. This ensures both 413 and 429 errors go through the same callback, giving callers full control over the error response format. This benefits both OTLP and Prometheus remote write endpoints.

Added AbstractOTLPRestActionTests / OTLPMetricsRestActionTests covering the success path, empty body handling, transport action failure, oversized body rejection (413), and indexing pressure rejection (429). The abstract test class mirrors AbstractOTLPTransportActionTests to make adding tests for future signals straightforward.

Wire IndexingPressure into AbstractOTLPRestAction so protobuf
request bodies are streamed and accumulated under coordinating
memory pressure, bounded by http.max_protobuf_content_length.

Refactor IndexingPressureAwareContentAggregator to own the full
reservation lifecycle: it now takes IndexingPressure directly and
performs the reservation in accept(), routing failures through the
onFailure callback instead of throwing. This ensures both 413
(body too large) and 429 (pressure rejected) produce
format-appropriate responses for OTLP and Prometheus endpoints.

Add AbstractOTLPRestActionTests with coverage for success, empty
body, transport failure, 413, and 429 paths.
@felixbarny felixbarny requested a review from DaveCTurner March 11, 2026 11:10
@felixbarny felixbarny self-assigned this Mar 11, 2026
@felixbarny felixbarny requested a review from a team as a code owner March 11, 2026 11:10
@felixbarny felixbarny added the :StorageEngine/TSDB You know, for Metrics label Mar 11, 2026
@elasticsearchmachine elasticsearchmachine added Team:StorageEngine v9.4.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Mar 11, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @felixbarny, I've created a changelog YAML for you.

@felixbarny
Copy link
Copy Markdown
Member Author

I confirmed in the benchmarking environment that there are no regressions in terms of ingest throughput related to this change.

@sidosera
Copy link
Copy Markdown
Contributor

LGTM, accepting to unblock. Let's make sure @DaveCTurner review this too.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, accepting to unblock. Let's make sure @DaveCTurner review this too.

I don't understand the purpose of this. One approval is enough to merge, so if you don't think your own review is enough, please don't make it an approving review.

I will take a deeper look this afternoon.

Pass BodyPostProcessor.NOOP to IndexingPressureAwareContentAggregator
where the new parameter was added in main but not yet in callers on
this branch.
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production changes all look good. However, given the importance of having the right response codes on overload (413 vs 429) I would prefer that we cover these cases in a javaRestTest rather than just in AbstractOTLPRestActionTests. I suggest adjusting AbstractOTLPIndexingRestIT to set indexing_pressure.memory.coordinating.limit to something less than 8MiB on the test node and then verify that requests which exceed the limit get a 429 and ones which exceed 8MiB get a 413.

@felixbarny
Copy link
Copy Markdown
Member Author

Thanks for the suggestion, David. I've added tests now. PTAL.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@felixbarny felixbarny merged commit 3e5eed8 into elastic:main Mar 13, 2026
36 checks passed
@felixbarny felixbarny deleted the otlp-indexing-pressure branch March 13, 2026 15:39
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 13, 2026
…elocations

* upstream/main: (72 commits)
  [Test] Randomly disable sequence numbers in CcrTimeSeriesDataStreamsIT (elastic#143930)
  Fix AsyncSearchIndexServiceTests.testCircuitBreaker failure (elastic#144058)
  Refine GenerativeIT some more, this time with accounting for some added (elastic#144220)
  ESQL: Physical Planning on the Lookup Node (elastic#143707)
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:approximation.Approximate stats by with zero variance} elastic#144240
  Trigger counter metrics in test for delta temporality measurements (elastic#144193)
  fix capabiltiy approximation_v3 (elastic#144230)
  [ci] Add PR pipeline for testing ipv6 and fix tests not working with ipv6 (elastic#140473)
  update (elastic#144095)
  Make from/to optional in TBUCKET when Kibana timestamp filter is present (elastic#144057)
  Extract reroute behavior from create-index request classes (elastic#144140)
  ESQL: Fix release build only failures (elastic#144122)
  ES|QL query approximation: move sample correction to data node (elastic#144005)
  Add indexing pressure tracking to OTLP endpoints (elastic#144009)
  Fix replica writes after _seq_no doc values are pruned (elastic#144180)
  allow tests to configure supportsLoadingConfig (elastic#144061)
  [ES|QL] Unmute testGiantTextFieldInSubqueryIntermediateResultsWithSort (elastic#144126)
  [ESQL][DOCS] Add CPS page (unpublished for moment) (elastic#144206)
  ESQL: Forbid "load" unmapped_fields for certain commands (elastic#144115)
  Add CCS Remote Views Detection (elastic#143384)
  ...
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Mar 16, 2026
Wire IndexingPressure into AbstractOTLPRestAction so protobuf
request bodies are streamed and accumulated under coordinating
memory pressure, bounded by http.max_protobuf_content_length.

Refactor IndexingPressureAwareContentAggregator to own the full
reservation lifecycle: it now takes IndexingPressure directly and
performs the reservation in accept(), routing failures through the
onFailure callback instead of throwing. This ensures both 413
(body too large) and 429 (pressure rejected) produce
format-appropriate responses for OTLP and Prometheus endpoints.

Add AbstractOTLPRestActionTests with coverage for success, empty
body, transport failure, 413, and 429 paths.
michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
Wire IndexingPressure into AbstractOTLPRestAction so protobuf
request bodies are streamed and accumulated under coordinating
memory pressure, bounded by http.max_protobuf_content_length.

Refactor IndexingPressureAwareContentAggregator to own the full
reservation lifecycle: it now takes IndexingPressure directly and
performs the reservation in accept(), routing failures through the
onFailure callback instead of throwing. This ensures both 413
(body too large) and 429 (pressure rejected) produce
format-appropriate responses for OTLP and Prometheus endpoints.

Add AbstractOTLPRestActionTests with coverage for success, empty
body, transport failure, 413, and 429 paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :StorageEngine/TSDB You know, for Metrics Team:StorageEngine v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants