Skip to content

[performance] process-wide cache for advanced settings lookup#262618

Merged
stratoula merged 43 commits intoelastic:mainfrom
drewdaemon:advanced-settings-per-setting-cache
Apr 22, 2026
Merged

[performance] process-wide cache for advanced settings lookup#262618
stratoula merged 43 commits intoelastic:mainfrom
drewdaemon:advanced-settings-per-setting-cache

Conversation

@drewdaemon
Copy link
Copy Markdown
Contributor

@drewdaemon drewdaemon commented Apr 10, 2026

Summary

This is a performance enhancement to the UI Settings service. It introduces a space-aware, process-wide cache layer for the settings saved object. It applies to .get('setting-name') lookups on all UI settings clients.

Justification

When a dashboard loads, search-related settings need to be consulted before each search request. That can mean 20+ saved object lookups (see #84923) which are each happening in different requests, so the per-request cache doesn't cover it.

Caching approach

NamespacedCache is the class that handles caching. It stores both resolved settings objects and in-flight requests to deduplicate read operations (getUserDefined).

The server-side cache is freshened in three scenarios

  • A setting is changed (setMany)
  • A browser requests index.html
  • A settings lookup occurs after the TTL has expired

The browser UI settings client is guaranteed to get fresh settings page loads (this preserves the existing UX for changing settings).

In a multi-node scenario, the node performing a setting change gets fast consistency. Other nodes get eventual consistency.

Checklist

Comment thread src/core/packages/ui-settings/server-internal/src/per_setting_cache.ts Outdated
@drewdaemon drewdaemon changed the title [PoC] Advanced settings per setting cache [PoC] per setting cache for advanced settings Apr 10, 2026
@drewdaemon drewdaemon changed the title [PoC] per setting cache for advanced settings [PoC] per-setting cache for advanced settings lookup Apr 10, 2026
@Dosant Dosant self-requested a review April 12, 2026 08:19
Comment thread src/core/packages/ui-settings/server-internal/src/namespaced_cache.ts Outdated
@kibanamachine
Copy link
Copy Markdown
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#11609

[❌] src/platform/test/functional/apps/context/config.ts: 2/5 tests passed.

see run history

@drewdaemon drewdaemon force-pushed the advanced-settings-per-setting-cache branch from e2c4760 to 48b9319 Compare April 14, 2026 21:46
@kibanamachine
Copy link
Copy Markdown
Contributor

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#11610

[✅] src/platform/test/functional/apps/context/config.ts: 10/10 tests passed.

see run history

@elastic elastic deleted a comment from elasticmachine Apr 14, 2026
@drewdaemon drewdaemon requested a review from a team as a code owner April 16, 2026 19:52
@drewdaemon drewdaemon requested a review from Dosant April 17, 2026 02:16
Copy link
Copy Markdown
Member

@kertal kertal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏆 For this PR

While then doing visual performance testing locally with default settings, there's not much difference between main and this PR, when simulating a slower network connection between ES and Kibana, the win in this case is clearly visible. What I did was applying the following in my local system to slow down the connection:

# Enable pf
sudo pfctl -E

# Configure the dummynet pipe with 100ms delay
sudo dnctl pipe 1 config delay 100ms

# Apply both inbound and outbound rules in one shot
sudo pfctl -f - <<'EOF'
dummynet out proto tcp from any to 127.0.0.1 port 9200 pipe 1
dummynet in proto tcp from any port 9200 to any pipe 1
EOF

Loading our e-commerce dashboard in this case looked like this

Image

Median improvement: 0.706s, Average 1.004s when running it 10x. Left side with cache, right side without

More details

race_with-cache_vs_without-cache.html

return result;
}

private async computeUserProvided<T = unknown>(bypassCache = false): Promise<UserProvided<T>> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like bypassCache is not used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ handleWriteErrors }: { validateKeys?: boolean; handleWriteErrors?: boolean } = {}
) {
this.cache.del();
// check if a read is currently in progress, wait for it to complete before proceeding
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an outdated comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


this.onWriteHook(changes);

// Register write operation as in-flight so concurrent reads will wait
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an outdated comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timer: NodeJS.Timeout;
}

export const NAMESPACED_CACHE_TTL = 60_000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed lowering this to reduce eventual consistency time in a multinode setup?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, totally spaced 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value do you think is reasonable given the cache-bust-on-page-load addition?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 seconds to be on the safe side? I think this is what we had in the per-request cache?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 5 seconds made sense for a request cache because few requests take longer than 5 seconds.

On a busy cluster, the cache will be getting refreshed via page load often so we're really talking about the absolute outer-bound on eventual consistency. And looking at it from a dashboard perspective, users will often be changing a filter, etc at greater than 5 second intervals so I think they'd often be waiting on a refresh if the TTL is so low.

That together with the fact that changing these settings is an edge case makes me lean more aggressive. E.g. 10-15 seconds.

But, even with 5 seconds this is a significant improvement over the current situation. I defer to you, of course!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60 is def very aggressive. I would be open for 10s though, I think it will work nice. But @Dosant you prefer 5s it is ok by me. Let me know and I will make the change (Drew is out and he told me to take care of this PR)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you both! let's do 10! 👍

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here 8697441

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Agreed that 60 is too aggressive... I put it there before I considered multi-node scenarios)

Copy link
Copy Markdown
Contributor

@Dosant Dosant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for getting this over the line

export class NamespacedCache<T = unknown> {
private readonly entries = new Map<string, NamespacedCacheEntry<T>>();
private readonly inflightReads = new Map<string, Promise<T>>();
private readonly inflightWrites = new Map<string, Promise<void>>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think inflightWrites are not used and can be cleaned up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right 3d7c96e

timer: NodeJS.Timeout;
}

export const NAMESPACED_CACHE_TTL = 60_000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you both! let's do 10! 👍

@elasticmachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #66 / EPM Endpoints installs and uninstalls all assets reinstalls all assets 0.2.0 should have created the correct saved object

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/core-ui-settings-server 24 25 +1
Unknown metric groups

API count

id before after diff
@kbn/core-ui-settings-server 42 43 +1

History

Copy link
Copy Markdown
Contributor

@walterra walterra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review only for datavis codeownership, LGTM.

@stratoula stratoula merged commit 9a59de9 into elastic:main Apr 22, 2026
18 checks passed
mbondyra added a commit to mbondyra/kibana that referenced this pull request Apr 22, 2026
…sationChanges23

* commit '9a7b717c662d1c904052bc59f0e5a81daab87c7f': (145 commits)
  Upgrade EUI to v114.2.0 (elastic#264550)
  [Entity Analytics] Add missing OpenAPI descriptions and examples to p… (elastic#264778)
  [Entity Resolution] Clarify CSV upload result for already-linked entities (elastic#264689)
  [AI Infra] Fix failing GenAI Settings Scout tests (elastic#260496)
  [Agent Builder] [Bug Bash] OAuth connector settings mention fields that are not there (elastic#264756)
  [performance] process-wide cache for advanced settings lookup (elastic#262618)
  [CI] Update limits.yml for securitySolution (elastic#264946)
  [SLO] Fix APM embeddable ids (elastic#264750)
  [EDR Workflows] Unify artifacts empty state buttons (elastic#264389)
  [Alert Triage workflow] Adds security.buildAlertEntityGraph and security.renderAlertNarrative… (elastic#259159)
  [SigEvents] Add KI feature identification endpoints and refactor task to use shared service (elastic#263528)
  [Scout] Migrate Data Views API tests from FTR - Part5 (elastic#264088)
  [Cases] Apply shared extended_fields path util server side (elastic#264706)
  [Lens as code] Fix metric trendline (elastic#264777)
  [api-docs] 2026-04-22 Daily api_docs build (elastic#264882)
  [Scout] Update test config manifests (elastic#264575)
  [workflows_management] Lazy-load Zod connector schemas to cut idle memory (elastic#264283)
  [ES|QL] Fix ES|QL columns reset race during active fetch (elastic#263947)
  [Content List] Column layout props, sticky actions, and title click handlers (elastic#264203)
  [Lens as code] Validate `id` in route for new vis types (elastic#264480)
  ...
SoniaSanzV pushed a commit to SoniaSanzV/kibana that referenced this pull request Apr 27, 2026
…c#262618)

## Summary

This is a performance enhancement to the UI Settings service. It
introduces a space-aware, process-wide cache layer for the settings
saved object. It applies to `.get('setting-name')` lookups on all UI
settings clients.

### Justification

When a dashboard loads, search-related settings need to be consulted
before each search request. That can mean 20+ saved object lookups (see
elastic#84923) which are each happening
in different requests, so the [per-request
cache](elastic#84513) doesn't cover it.

### Caching approach

`NamespacedCache` is the class that handles caching. It stores both
resolved settings objects and in-flight requests to deduplicate read
operations (`getUserDefined`).

The server-side cache is freshened in three scenarios
- A setting is changed (`setMany`)
- A browser requests `index.html`
- A settings lookup occurs after the TTL has expired

The browser UI settings client is guaranteed to get fresh settings page
loads (this preserves the existing UX for changing settings).

In a multi-node scenario, the node performing a setting change gets fast
consistency. Other nodes get eventual consistency.

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Stratou <efstratia.kalafateli@elastic.co>
awahab07 added a commit that referenced this pull request Apr 27, 2026
)

This PR:
- stabilizes the suite by reducing advanced-setting flips.
- groups disabled and enabled scenarios so the suite only transitions
into the enabled state once.
- waits for the enabled setting to propagate before running the
enabled-path assertions (following
[ref](#262618 (comment))).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.