Skip to content

[performance] continuous polling#256564

Merged
drewdaemon merged 42 commits intoelastic:mainfrom
drewdaemon:229903/continuous-polling
Apr 10, 2026
Merged

[performance] continuous polling#256564
drewdaemon merged 42 commits intoelastic:mainfrom
drewdaemon:229903/continuous-polling

Conversation

@drewdaemon
Copy link
Copy Markdown
Contributor

@drewdaemon drewdaemon commented Mar 6, 2026

Summary

Close #229903
Close #186145

This PR implements two key (async) search performance optimizations related to polling.

  1. When the browser is using a protocol which supports multiplexing, Kibana-side sleeps are eliminated and long-polling is used, ensuring results are delivered as soon as possible.
  2. One of the Elasticsearch requests that used to happen after polling has been removed.

Reviewer notes

retrieveResults has been renamed to returnIntermediateResults to match the Elasticsearch parameter name, but should be functionally identical.

Settings behavior

wait_for_completion_timeout

  flowchart TD
      Start[Start] --> Phase{Phase?}

      Phase -->|Initial Submit| SubmitConfig[Use search.asyncSearch.waitForCompletion]

      Phase -->|Polling GET| ClientConfig{search.asyncSearch.pollLength set?}

      ClientConfig -->|Yes - Number| UseClientConfig[Use config value]
      ClientConfig -->|No| Multiplex{Is the browser using HTTP/2 or<br/>HTTP/3?}

      Multiplex -->|Yes| Use30s[Use 30000ms]
      Multiplex -->|No| Undefined[Omitted - functionally zero]
Loading

pollInterval

  flowchart TD
      Start[Start] --> ConfigSet{search.asyncSearch.pollInterval set?}

      ConfigSet -->|Yes| UseConfig[Use config value]
      ConfigSet -->|No| CheckMultiplex{HTTP/2 or<br/>HTTP/3?}

      CheckMultiplex -->|Yes| UseZero[Use 0ms]
      CheckMultiplex -->|No| CheckStatic{Static value<br/>provided?}

      CheckStatic -->|Yes| UseStatic[Use that value]
      CheckStatic -->|No| ElapsedTime{Elapsed time?}

      ElapsedTime -->|< 1.5s| Use300[300ms]
      ElapsedTime -->|< 5s| Use1000[1000ms]
      ElapsedTime -->|< 20s| Use2500[2500ms]
      ElapsedTime -->|>= 20s| Use5000[5000ms]
Loading

Checklist

  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

Identify risks

The main risk is if an on-prem deployment has for some reason configured aggressive network timeouts and also has HTTP/2 enabled for browser communication, they may see timeout errors after upgrading.

  • has set elasticsearch.idleSocketTimeout or server.socketTimeout to a value less than 30 seconds.
  • uses a proxy that has a timeout configured at less than 30 seconds.

In all these cases, the user can fix the behavior by raising the interfering timeouts (preferred) or by tuning down the poll length via kibana.yml.

Release note

Sped up fetching Elasticsearch data for setups using HTTP/2 as the browser's communication protocol.

: {
...(await getIgnoreThrottled(uiSettingsClient)),
...defaultParams,
...getCommonDefaultAsyncGetParams(searchConfig, options, {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the Get method appears to have been a mistake/oversight.

@davismcphee
Copy link
Copy Markdown
Contributor

I think this one deserves two sets of eyes on our end. Requesting reviews from both @AlexGPlay and @lukasolson.

Copy link
Copy Markdown
Contributor

@stratoula stratoula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ES|QL changes LGTM!

(what a lovely change, but I agree with Davis it would need some testing. Let me know if you want me to test too, I ony did a code review)

Copy link
Copy Markdown
Contributor

@iblancof iblancof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obs-exploration code changes LGTM.
Only renaming retrieveResults to returnIntermediateResults.

* setting this to `true` will request the search results, regardless of whether or not the search is complete.
*/
retrieveResults?: boolean;
returnIntermediateResults?: boolean;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean that we can get incremental results while the search is in progress?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know the Elasticsearch naming could be taken either way, but we are really talking about "partial results," not "incremental results."

But to your question, there should be no practical change in the behavior from the old retrieveResults param. retrieveResults has always meant, "retrieve all results that are available whether or not the search is complete."

The change is that instead of switching endpoints from the status to the GET endpoint when returnIntermediateResults is set, we control the behavior with a param to the GET endpoint.

@kertal

This comment was marked as outdated.

@drewdaemon
Copy link
Copy Markdown
Contributor Author

@kertal nice to see an early validation with at least some modest gains. The gains are going to be very dependent on the original search duration.

sawtooth

@kertal

This comment was marked as duplicate.

@kertal
Copy link
Copy Markdown
Member

kertal commented Apr 8, 2026

@kertal nice to see an early validation with at least some modest gains. The gains are going to be very dependent on the original search duration.

Yes, I do agree the real gain needs then distribution of various search durations, which can't be simulated with a static scenario like the given one

Update Can't be simulated, unless you ask AI to generate a few dashboards with in increasing number of stall time : for a 21s stalled search the gain seems to 4.8s seconds 🎉 , for a 22s one it's 0.8s (pending numbers, needs to run multiple times)

Update 2 So this is how it looks like when loading 23 dashboards in a row with increasing query time (by increasing the stalling value on the ES filter from 1 to 23s ), first part of the the video with increased speed for better dramatization. It's not very exciting to see the same dashboard loading for several minutes 😆 . However the result at the end is exiting, showing that you get more than one long running dashboard for free. So with polling there's a speed gain of half a minute in this case 🥳

kibana-data-polling.mp4

here are the numbers, in format {ES query time}:{faster dashboard render time}
1s:0.074s, 2s:0.474s, 3s:0.510s, 4s:0.645s, 5s:0.688s, 6s:1.982s, 7s:1.185s, 8s:0.178s, 9s:1.300s, 10s:0.711s, 11s:2.202s, 12s:1.261s, 13s:0.332s, 14s:1.897s, 15s:1.877s, 16s:2.224s, 17s:1.240s, 18s:0.253s, 19s:1.909s, 20s:0.685s, 21s:4.864s, 22s:4.077s, 23s:2.918s

Comment on lines +73 to +75
return isRunningResponse(response)
? timer(getPollInterval(elapsedTime)).pipe(switchMap(() => search()))
: EMPTY;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious as to why we use this vs. takeWhile?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I made the changes in this PR, the way it was set up before was making an extra request at the end after the results were already available. But I am not an rxjs guru so open to suggestions

Comment on lines +177 to +182
entries.forEach((entry) => {
if (entry.name.includes('/internal/search/')) {
this.protocolSupportsMultiplexing = ['h2', 'h3'].includes(entry.nextHopProtocol);
this.performanceObserver?.disconnect(); // We only need to detect this once, so we can disconnect the observer after the first match
}
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is there any reason we need to continue looping through the array after we've found an appropriate entry?

Suggested change
entries.forEach((entry) => {
if (entry.name.includes('/internal/search/')) {
this.protocolSupportsMultiplexing = ['h2', 'h3'].includes(entry.nextHopProtocol);
this.performanceObserver?.disconnect(); // We only need to detect this once, so we can disconnect the observer after the first match
}
});
const entry = entries.find(({ name }) => name.includes('/internal/search/'));
if (entry) {
this.protocolSupportsMultiplexing = ['h2', 'h3'].includes(entry.nextHopProtocol);
this.performanceObserver?.disconnect(); // We only need to detect this once, so we can disconnect the observer after the first match
}
});

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Preserve and project first request params into responses.
let firstRequestParams: SanitizedConnectionRequestParams;

const pollInterval = this.deps.searchConfig.asyncSearch.pollInterval
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this always uses the pollInterval from configuration (if it's configured), even if the protocol supports multiplexing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the user's explicit settings are always respected

@drewdaemon drewdaemon removed the request for review from a team April 9, 2026 16:11
Copy link
Copy Markdown
Contributor

@lukasolson lukasolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, when running in http2 mode, background search seems to behave a little funky:

Screen.Recording.2026-04-09.at.10.33.15.AM.mov

Here's the flow:

  1. Start a long query
  2. The first response comes back with the search ID
  3. Click the "send to background" button
  4. A new request goes out with that search ID to attach it to the background search
  5. We wait until that response until we show the notification (which ends up being when the search completes)

The request from 4 ends up using the same wait_for_completion_timeout as the main request, and so it doesn't complete until the search request is complete. We probably need to send a shorter wait_for_complete_timeout for that request specifically (as far as I'm aware, we don't care about the results in that response, just to make sure it gets attached to the background search saved object).

@drewdaemon drewdaemon requested a review from lukasolson April 9, 2026 18:19
@drewdaemon
Copy link
Copy Markdown
Contributor Author

drewdaemon commented Apr 9, 2026

@lukasolson great catch. Thought I'd tested background searches but it must have slipped through the cracks. See what you think about 3a36579 and fa87041

Copy link
Copy Markdown
Contributor

@lukasolson lukasolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this change! LGTM. I did notice another odd behavior while testing but it's unrelated to these changes: #262384

@elasticmachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
data 2622 2619 -3

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
esql 821.2KB 821.2KB +10.0B
streamsApp 1.9MB 1.9MB +10.0B
total +20.0B

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
data 32 31 -1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
data 441.9KB 442.8KB +888.0B
Unknown metric groups

API count

id before after diff
data 3245 3243 -2

History

cc @drewdaemon

Copy link
Copy Markdown
Contributor

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResponseOps changes LGTM

@drewdaemon drewdaemon merged commit 9b6b455 into elastic:main Apr 10, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting Feature:Search Querying infrastructure in Kibana release_note:enhancement v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[data.search] Use continuous-polling to monitor async search progress [data.search] allow polling status to be streamed