Skip to content

[workflows_management] Lazy-load Zod connector schemas to cut idle memory#264283

Merged
Kiryous merged 3 commits into
elastic:mainfrom
talboren:zod-lazy-workflows-management
Apr 22, 2026
Merged

[workflows_management] Lazy-load Zod connector schemas to cut idle memory#264283
Kiryous merged 3 commits into
elastic:mainfrom
talboren:zod-lazy-workflows-management

Conversation

@talboren
Copy link
Copy Markdown
Contributor

@talboren talboren commented Apr 19, 2026

Summary

Contributes to the 9.4.0 OOM effort for 1GB ECH/ECK deployments (parent epic: #264170, this sub-issue: #264175).

Heap snapshots traced ~16MB of retained Zod-schema heap to @kbn/workflows-management-plugin, originating from common/connector_action_schema.ts. At module-load time the plugin:

  1. Eagerly imports every ./stack_connectors_schema/* submodule.
  2. Eagerly imports the ~21MB @kbn/connector-specs package.
  3. Immediately constructs 5 connector-schema Maps and a staticConnectors array, each populated with fully-instantiated Zod schemas — whether or not any workflow code ever runs.

On a 1GB Kibana pod this contributed directly to idle-memory OOM kills after the Zod v4 upgrade.

What this PR does

  • Converts the 5 Maps and staticConnectors into cached getter functions:
    • getConnectorSpecsInputSchemas()
    • getConnectorInputSchemas()
    • getConnectorActionInputSchemas()
    • getConnectorOutputSchemas()
    • getConnectorActionOutputSchemas()
    • getStaticConnectors()
  • Moves the heavy `require()`s for `./stack_connectors_schema`, `@kbn/connector-specs`, and `FetcherConfigSchema`/`KibanaStepMetaSchema` into arrow-function loaders that run only on first getter call. The arrow-function form sidesteps `@typescript-eslint/no-var-requires` without eslint-disables (same pattern used in `@kbn/workflows/spec/kibana/index.ts`).
  • Updates the sole consumer, `common/schema.ts`, to use the getters.
  • Removes the unused `WORKFLOW_ZOD_SCHEMA` / `WORKFLOW_ZOD_SCHEMA_LOOSE` eager exports (the file already had a `TODO` flagging them as dead).
  • Adds a test utility `__resetConnectorSchemaCachesForTesting()`.

Tests

  • Updated `connector_action_schema.test.ts` to use the getter API + added caching assertions.
  • New `connector_action_schema.lazy.test.ts` is a regression guard that inspects `require.cache` to assert:
    • Importing `connector_action_schema` does not load any `stack_connectors_schema/*` or `@kbn/connector-specs` module.
    • Importing `schema.ts` (the transitive consumer) also does not.
    • The stack-connector modules only appear in the cache after `getConnectorInputSchemas()` runs.
    • `@kbn/connector-specs` only appears after `getConnectorSpecsInputSchemas()` runs.
    • Getters return cached instances on repeat calls.

Memory impact

Expected reduction on an idle Kibana where no workflow code path is exercised:

Kibana instances that actively use workflows pay the same ~16MB on first use — trade-off is acceptable and only hits once per process.

Risk / scope

  • No external API change. The Maps/`staticConnectors` were never re-exported from `common/index.ts`; grep confirms no consumer outside this package referenced them.
  • Server-side consumers (`WorkflowsService`, agent-builder tools) go through `common/schema.ts` which is fully updated.
  • Public-side consumers (react components under `public/`) also go through the same `schema.ts` getters.

Not in scope (follow-ups)

  • `@kbn/connector-specs` internal lazy-loading — owned by response-ops in Zod lazy schemas @kbn/connector-specs #264180. This PR already defers loading it at startup, but the package itself is still expensive to evaluate on first getter call.
  • Agent-builder attachments (`workflow_edit_tools` / `workflow_yaml_*`) — ~1MB, separate import chain, can be a follow-up.

Validation

  • `node scripts/check`: ✓ lint (4 files), ✓ jest (4244 tests), ✓ tsc (1 project)
  • `node scripts/jest src/platform/plugins/shared/workflows_management/common/connector_action_schema.lazy.test.ts`: all 5 tests pass
  • `node scripts/type_check --project src/platform/plugins/shared/workflows_management/tsconfig.json`: clean

Closes #264175

For reviewers

  • `connector_action_schema.ts` is large-diff but small-change: the body of each Map/array now lives inside a getter. `git diff -b` may be easier to read.
  • The `require()`-inside-arrow-function pattern is deliberate; `const x = require(...)` would trigger `@typescript-eslint/no-var-requires`.

Checklist

  • Added unit tests for the bug fix (both API-level and lazy-loading regression).
  • No API changes, so no docs update needed.
  • `node scripts/check` clean.
  • Verified no meaningful perf regression on first workflow edit/execute after startup.

Made with Cursor

@talboren talboren added backport This PR is a backport of another PR v9.4.0 release_note:skip Skip the PR/issue when compiling release notes Team:One Workflow Team label for One Workflow (Workflow automation) labels Apr 19, 2026
@Kiryous Kiryous self-requested a review April 20, 2026 10:06
Copy link
Copy Markdown
Contributor

@semd semd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 💯
The heap analysis and the targeted fix look correct to me.

Before merging, I'd like to suggest an alternative shape that I think gets the same memory win with substantially less surface area. Curious what you think.

The observation

The six lazy getters in connector_action_schema.ts always end up firing together: opening the YAML editor calls getWorkflowZodSchema()getAllConnectorsInternal() → all six getters on the same call stack. Monaco needs the full union schema for autocomplete, so per-Map (and per-connector) deferral collapses to a single effective deferral in practice. That means we're paying a fair amount of refactoring cost (six getters, six caches, a test-only reset, three require()-in-arrow workarounds, ~500 lines of churn in connector_action_schema.ts) to defer six things that always defer together.

Proposed alternative: single boundary in schema.ts

schema.ts is the only consumer of connector_action_schema.ts, so it's the natural single point of control. We already have memoize-one available, which is a clean fit here:

// connector_action_schema.ts — REVERT to the original eager Map exports.
// No getters, no cached vars, no __resetForTesting.
// Add a warning comment to not import statically from this file. just in case
export const ConnectorInputSchemas = new Map([...]);
export const ConnectorActionInputSchemas = new Map([...]);
// ...etc
// schema.ts — the only consumer becomes the single lazy boundary.
import memoizeOne from 'memoize-one';

const getConnectorSchemas = memoizeOne(
  // eslint-disable-next-line @typescript-eslint/no-var-requires -- defers ~16MB zod heap, see #264175
  (): typeof import('./connector_action_schema') => require('./connector_action_schema')
);

function getSubActionParamsSchema(actionTypeId: string, subActionName: string) {
  const { ConnectorInputSchemas, ConnectorActionInputSchemas, ConnectorSpecsInputSchemas } =
    getConnectorSchemas();
  // ...original lookup logic, unchanged
}

Comparison

Current PR Single-boundary
connector_action_schema.ts diff +500 / −500 0
Lazy boundaries 6 1
require() workarounds 3 (arrow-fn trick) 1 (explicit, justified disable)
Test-only API surface __resetConnectorSchemaCachesForTesting none (jest.resetModules())
Idle-startup memory savings ~16 MB ~16 MB (same)

Same memory outcome, much less code, no API change to connector_action_schema.ts, and the lazy boundary lives in the file that actually consumes it.

Caveat to verify

The PR description says common/index.ts doesn't re-export the Map constants and no consumer outside schema.ts references them. If that still holds on the latest base, the single-boundary version is strictly better. The lazy regression test added is still valuable, just point it at the schema.ts boundary instead of the six getters.

Happy to be wrong about any of this, wanted to suggest it before the structural choice is made.

@Kiryous
Copy link
Copy Markdown
Contributor

Kiryous commented Apr 20, 2026

In general, I agree with @semd's suggestion.

A couple of other observations/questions:

  1. Quick bench on my machine: first getConnectorSpecsInputSchemas() is ~565ms sync, first getConnectorInputSchemas() ~180ms. We've moved the cost from startup to the first edit/validate, which is what we wanted, but it's a noticeable stall on the first workflow interaction after a cold start. Can we follow up by warming it from an idle tick in setup(), or switching to a dynamic import() so it's async?

  2. The CI bot flagged +33.1KB on workflowsManagement. Does this still reproduce if we go with @semd's single-boundary approach? That one doesn't touch connector_action_schema.ts at all, so I'd expect the bundle diff to go back to neutral — worth confirming before we merge a memory-saving refactor that grows the bundle.

  3. Leaving this thread open until we have an RSS snapshot from an actual Kibana process (not just the jest harness) confirming the ~16MB is deferred end-to-end. Happy to script the before/after if it helps.

@Kiryous Kiryous self-assigned this Apr 21, 2026
@Kiryous
Copy link
Copy Markdown
Contributor

Kiryous commented Apr 21, 2026

Heap snapshot analysis (built Kibana, allocation tracking)

Compared against Rudolf's baseline (main @ eba1d16, 822.8 MB idle heap).
Branch snapshot: built Kibana with HEAP_TRACK_FORCE=1, idle, no UI opened.

Total idle heap: 822.8 → 811.7 MB (-11.1 MB)

Allocated by Plugin (allocation site) — before vs after

Plugin Before (MB) After (MB) Delta
@kbn/security-solution-plugin 87.7 89.3 +1.6
@kbn/stack-connectors-plugin 46.9 46.8 -0.1
@kbn/alerting-plugin 37.0 37.1 +0.1
@kbn/actions-plugin 31.3 31.3 0.0
@kbn/fleet-plugin 27.0 30.0 +3.0
@kbn/streams-plugin 25.5 25.8 +0.3
@kbn/cases-plugin 23.6 23.6 0.0
@kbn/workflows-management-plugin 17.2 4.5 -12.7
@kbn/agent-builder-platform-plugin 15.1 15.1 0.0
@kbn/entity-store 14.8 14.5 -0.3
@kbn/lists-plugin 11.9 11.9 0.0
@kbn/apm-plugin 11.0 11.2 +0.2
@kbn/monitoring-plugin 9.4 9.4 0.0
@kbn/alerting-v2-plugin 9.1 9.2 +0.1
@kbn/elastic-assistant-plugin 8.9 9.2 +0.3
@kbn/agent-builder-plugin 7.6 7.8 +0.2
(remaining plugins) ±0.5 noise
(no plugin frame) 236.4 241.6 +5.2
(untracked) 33.5 23.7 -9.8

zod retained dropped by a matching 12.6 MB (236.8 → 224.2). No other plugins regressed beyond noise.

Remaining 4.5 MB breakdown

Dug into what's still allocated eagerly on the server path:

  1. @kbn/workflows barrel (~3-4 MB)common/schema.ts value-imports builtInStepDefinitions, generateYamlSchemaFromConnectors, SystemConnectorsMap etc. from @kbn/workflows. That loads spec/schema.ts which has a bunch of top-level Zod schemas (DurationSchema, BaseStepSchema, WorkflowSettingsSchema…). Same barrel is also pulled by workflows_management_service.ts and workflows_management_api.ts. This is the dominant remaining cost.

  2. Small eager Zod in the plugin (~0.3-0.5 MB)server/connectors/workflows/schema.ts (ExecutorParamsSchema) and common/lib/import/index.ts (WorkflowExportManifestSchema etc.) define Zod at module scope. Negligible individually.

  3. Zod runtime (~0.5 MB) — the plugin loads both @kbn/zod and @kbn/zod/v4 through different paths, each carrying its own runtime.

Lazy-loading @kbn/workflows would be the next meaningful win, but that's a deeper change since schema.ts actively uses functions from it — probably a separate follow-up.

Kiryous added 2 commits April 21, 2026 19:40
…mory

Single lazy boundary in schema.ts (the sole consumer of
connector_action_schema.ts) defers ~16 MB of zod-schema heap until the
first workflow edit/execute call. connector_action_schema.ts is left
untouched — no getter wrappers, no test-only reset API.

Removes the unused WORKFLOW_ZOD_SCHEMA / WORKFLOW_ZOD_SCHEMA_LOOSE
module-level constants whose eager generateYamlSchemaFromConnectors()
calls contributed to the startup heap.

Heap snapshot (built Kibana, allocation tracking, idle):
  @kbn/workflows-management-plugin alloc site: 17.2 → 4.2 MB (-13 MB)

Closes elastic#264175

Made-with: Cursor
@Kiryous Kiryous force-pushed the zod-lazy-workflows-management branch from 0e55f4b to 829f896 Compare April 21, 2026 16:24
@Kiryous Kiryous marked this pull request as ready for review April 21, 2026 16:59
@Kiryous Kiryous requested a review from a team as a code owner April 21, 2026 16:59
Copy link
Copy Markdown
Contributor

@semd semd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Kiryous Kiryous added backport:version Backport to applied version labels and removed backport This PR is a backport of another PR labels Apr 22, 2026
@elasticmachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] affected Scout: [ platform / streams_app-stateful-classic ] plugin / local-stateful-classic - Stream data retention - inheritance - should toggle inherit mode on and off

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
workflowsManagement 2.3MB 2.3MB +539.0B
Unknown metric groups

ESLint disabled line counts

id before after diff
workflowsManagement 146 147 +1

Total ESLint disabled count

id before after diff
workflowsManagement 175 176 +1

History

cc @Kiryous

@Kiryous Kiryous merged commit bf4e1b0 into elastic:main Apr 22, 2026
20 checks passed
@kibanamachine
Copy link
Copy Markdown
Contributor

Starting backport for target branches: 9.4

https://github.com/elastic/kibana/actions/runs/24761518230

@kibanamachine
Copy link
Copy Markdown
Contributor

💚 All backports created successfully

Status Branch Result
9.4

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Apr 22, 2026
…dle memory (#264283) (#264885)

# Backport

This will backport the following commits from `main` to `9.4`:
- [[workflows_management] Lazy-load Zod connector schemas to cut idle
memory (#264283)](#264283)

<!--- Backport version: 9.6.6 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sorenlouv/backport)

<!--BACKPORT
[{"author":{"name":"Tal","email":"tal.borenstein@elastic.co"},"sourceCommit":{"committedDate":"2026-04-22T05:15:56Z","message":"[workflows_management]
Lazy-load Zod connector schemas to cut idle memory
(#264283)","sha":"bf4e1b09650ac7d9b0f90ee073e84b17d0eb58df","branchLabelMapping":{"^v9.5.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","backport:version","Team:One
Workflow","v9.4.0","v9.5.0"],"title":"[workflows_management] Lazy-load
Zod connector schemas to cut idle
memory","number":264283,"url":"https://github.com/elastic/kibana/pull/264283","mergeCommit":{"message":"[workflows_management]
Lazy-load Zod connector schemas to cut idle memory
(#264283)","sha":"bf4e1b09650ac7d9b0f90ee073e84b17d0eb58df"}},"sourceBranch":"main","suggestedTargetBranches":["9.4"],"targetPullRequestStates":[{"branch":"9.4","label":"v9.4.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v9.5.0","branchLabelMappingKey":"^v9.5.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/264283","number":264283,"mergeCommit":{"message":"[workflows_management]
Lazy-load Zod connector schemas to cut idle memory
(#264283)","sha":"bf4e1b09650ac7d9b0f90ee073e84b17d0eb58df"}}]}]
BACKPORT-->

Co-authored-by: Tal <tal.borenstein@elastic.co>
mbondyra added a commit to mbondyra/kibana that referenced this pull request Apr 22, 2026
…sationChanges23

* commit '9a7b717c662d1c904052bc59f0e5a81daab87c7f': (145 commits)
  Upgrade EUI to v114.2.0 (elastic#264550)
  [Entity Analytics] Add missing OpenAPI descriptions and examples to p… (elastic#264778)
  [Entity Resolution] Clarify CSV upload result for already-linked entities (elastic#264689)
  [AI Infra] Fix failing GenAI Settings Scout tests (elastic#260496)
  [Agent Builder] [Bug Bash] OAuth connector settings mention fields that are not there (elastic#264756)
  [performance] process-wide cache for advanced settings lookup (elastic#262618)
  [CI] Update limits.yml for securitySolution (elastic#264946)
  [SLO] Fix APM embeddable ids (elastic#264750)
  [EDR Workflows] Unify artifacts empty state buttons (elastic#264389)
  [Alert Triage workflow] Adds security.buildAlertEntityGraph and security.renderAlertNarrative… (elastic#259159)
  [SigEvents] Add KI feature identification endpoints and refactor task to use shared service (elastic#263528)
  [Scout] Migrate Data Views API tests from FTR - Part5 (elastic#264088)
  [Cases] Apply shared extended_fields path util server side (elastic#264706)
  [Lens as code] Fix metric trendline (elastic#264777)
  [api-docs] 2026-04-22 Daily api_docs build (elastic#264882)
  [Scout] Update test config manifests (elastic#264575)
  [workflows_management] Lazy-load Zod connector schemas to cut idle memory (elastic#264283)
  [ES|QL] Fix ES|QL columns reset race during active fetch (elastic#263947)
  [Content List] Column layout props, sticky actions, and title click handlers (elastic#264203)
  [Lens as code] Validate `id` in route for new vis types (elastic#264480)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes Team:One Workflow Team label for One Workflow (Workflow automation) v9.4.0 v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Zod lazy schemas @kbn/workflows-management-plugin

5 participants