[Streams] Add failure store as a data source for simulations by CoenWarmer · Pull Request #249559 · elastic/kibana

CoenWarmer · 2026-01-19T13:36:44Z

Resolves https://github.com/elastic/streams-program/issues/268

Summary

This adds the Failure Store as an option when running simulations.

Screen.Recording.2026-01-19.at.14.43.56.mov

Details

Adds an internal endpoint GET /internal/streams/{name}/processing/_failure_store_samples which returns samples that are in the failure store for a stream which will have the processors applied of all the ancestor streams.
Adds failure-store as a DataSourceActor in XState

…Warmer/kibana into streams-add-failure-store-as-source

tonyghiani

I didn't check all the functionalities and the code yet, but something we should take care for this feature to be enabled is privileges and feature enablement.

If the failure store is not enabled or the user doesn't have privileges to access it, we shouldn't show the data source and the add action.

…rontend and backend

CoenWarmer · 2026-01-19T21:29:51Z

If the failure store is not enabled or the user doesn't have privileges to access it, we shouldn't show the data source and the add action.

Good point. Updated.

tonyghiani

Overall goog job, I left some note for changes we'd need to do.

Something I noticed playing with it on the UI is that we are probably doing some unnecessary steps for using this data source. The current approach treats the failure store data source like a kql or custom samples one, which are configurable data sources and can be spawned multiple times.

WRT failure store, it's more similar to the latest samples source.
I think we could simplify the UI and the discoverability of this data source if we simply keep it available in the dropdown and flyout as we do for the latest samples, having it available only for users with the right privileges and when failure store it's enabled.

@patpascal @LucaWintergerst wdyt?

tonyghiani · 2026-01-21T10:33:51Z

+/**
+ * Runs the ingest pipeline simulation with the given processors on the documents.
+ */
+async function simulateWithProcessors({


question: there is already the simulateProcessing handler that deals with a lot of edge cases for some processor types, such as manual ingest pipeline etc... this is probably a good place where we should re-use it, as we expect all those edge cases to be handled correctly.

tonyghiani · 2026-01-21T10:37:28Z

+}: FailureStoreSamplesDeps): Promise<FailureStoreSamplesResponse> => {
+  const { name } = params.path;
+  const size = params.query?.size ?? DEFAULT_SAMPLE_SIZE;
+  const start = params.query?.start;


For failure store samples, we should query the failure store for documents that failed after the last update on the stream processing, as the user might have already made a change that fixes how the failure store documents were breaking during processing.

For this, you can use the stream.ingest.processing.updated_at property.

flash1293 · 2026-01-21T10:53:03Z

High level comments, I just checked out the recording in the description:

The samples should "unwrap" the error message part and stuff, that isn't what we want to simulate, we just want the original document at the root, not in a field called document. Otherwise you can't fix your processing for newly incoming docs based on that
Should we have some additional affordance if there are (recent) docs in the failure store to pull them in? Right now it's really well hidden

CoenWarmer · 2026-01-21T13:13:40Z

@tonyghiani @flash1293

Thanks for the reviews! Just so I get this right: we want the failure store option to be added to the drop down that currently shows "Latest samples", instead of going through the steps of first adding it via the control that also allows adding custom KQL samples?

flash1293 · 2026-01-21T13:15:23Z

Thanks for the reviews! Just so I get this right: we want the failure store option to be added to the drop down that currently shows "Latest samples", instead of going through the steps of first adding it via the control that also allows adding custom KQL samples?

Not sure, that's one option but I'm not sure whether it's the right one. @patpascal what do you think?

tonyghiani · 2026-01-21T14:03:54Z

Thanks for the reviews! Just so I get this right: we want the failure store option to be added to the drop down that currently shows "Latest samples", instead of going through the steps of first adding it via the control that also allows adding custom KQL samples?

That's how I see that the simplest possible (just there when available), but let's wait for Patri's feedback

patpascal · 2026-01-21T16:02:17Z

I think that if users have failure store enabled, showing it as a selectable option, without requiring them to go through the extra steps to add it, would make things easier for them. So I agree.

…ateProcessor function

CoenWarmer · 2026-01-21T17:02:32Z

The samples should "unwrap" the error message part and stuff, that isn't what we want to simulate, we just want the original document at the root, not in a field called document. Otherwise you can't fix your processing for newly incoming docs based on that

Addressed with 5b58dce

Should we have some additional affordance if there are (recent) docs in the failure store to pull them in? Right now it's really well hidden

Addressed as well

flash1293

There is some kind of state issue:

Go to processing tab
Select failure store from the list
URL state is updated, but nothing changes
Refresh
It starts working

Works really well otherwise!

flash1293 · 2026-01-22T08:44:42Z

+      dataStream: dataStream as DataStreamWithFailureStore,
+    });
+    if (!isEnabledFailureStore(effectiveFailureStore)) {
+      throw new SecurityError(`Failure store is not enabled for stream ${params.path.name}`);


Nit: That's not really a security error, right?

Fair point. I added a specific FailureStoreNotEnabledError error instead.

flash1293 · 2026-01-22T08:48:55Z

+ * Fetches documents from the failure store and applies all configured processors
+ * from parent streams to transform them.
+ *
+ * Only documents that failed after the most recent processing update are returned,


Not sold on this, this might be too clever - what if there was an unrelated processing change? This would render the data source useless, even if the user knows what's going on. Processing changes can be messy, what if the user makes a change, then reverts it? They wouldn't be able to get samples anymore.

I think it's OK to return all failure store docs, if processing configurations have been fixed since it's fine anyway because we simulate this updated processing so it won't lead to an error.

We could have it as an optional setting in the details (or maybe just a general KQL query in there?):

But I don't think it's that important actually

CoenWarmer · 2026-01-22T09:54:56Z

@tonyghiani @flash1293

For failure store samples, we should query the failure store for documents that failed after the last update on the stream processing, as the user might have already made a change that fixes how the failure store documents were breaking during processing.

For this, you can use the stream.ingest.processing.updated_at property.

Not sold on this, this might be too clever - what if there was an unrelated processing change? This would render the data source useless, even if the user knows what's going on. Processing changes can be messy, what if the user makes a change, then reverts it? They wouldn't be able to get samples anymore.

I think it's OK to return all failure store docs, if processing configurations have been fixed since it's fine anyway because we simulate this updated processing so it won't lead to an error.

These two comments seem to conflict - one suggests filtering by updated_at, the other suggests returning all docs. Should we just return all docs? The simulation will show whether they're still failing with the current processing config, and filtering could hide useful historical failures if unrelated changes update the timestamp.

flash1293 · 2026-01-22T10:16:04Z

They do, sorry for that. Happy to discuss @tonyghiani what do you think after what I wrote in my comment?

tonyghiani · 2026-01-22T10:41:35Z

I see your point Joe, it's a right reasoning about the counter effect of my suggestion. Still, in both cases, it would be weird if we keep getting documents that are not actually failing anymore.

There could be another option, but that would probably be a bit overkill:

Get the samples from failure store
Process them through the whole pipeline, including the current stream.
Keep only the document that somewhere failed in the whole processing, meaning something in the pipeline is still broken just for them.
Re-process those samples with the parent processing only (as it works now) and return the final samples.

This allows to return only samples that are not currently going successfully through the pipeline and excludes those that were already fixed by some user changes. It might work well, although it complicates a bit the API. @flash1293 @CoenWarmer wdyt?

flash1293 · 2026-01-22T10:43:34Z

Can we have @LucaWintergerst make a quick call on this to not block the task?

CoenWarmer · 2026-01-22T11:12:51Z

There is some kind of state issue:

Go to processing tab

Select failure store from the list

URL state is updated, but nothing changes

Refresh

It starts working

Works really well otherwise!

Fixed with 5e91b6a

LucaWintergerst · 2026-01-22T13:52:51Z

I would suggest we go with all docs, and optionally offer the user an option in the settings where they can set a timepicker for this source, like we'd do it today for KQL
I find this to be more predictable
and if it's already fixed, great, then the user just sees that as part of the regular processing UI
and if it's mixed, even better, then the user can make sure that their changes don't negatively affect other recent failures

CoenWarmer · 2026-01-22T16:16:54Z

I would suggest we go with all docs, and optionally offer the user an option in the settings where they can set a timepicker for this source, like we'd do it today for KQL I find this to be more predictable and if it's already fixed, great, then the user just sees that as part of the regular processing UI and if it's mixed, even better, then the user can make sure that their changes don't negatively affect other recent failures

@LucaWintergerst @flash1293 @tonyghiani @patpascal

Like this?

Screen.Recording.2026-01-22.at.17.14.31.mov

…lure-store-as-source

LucaWintergerst · 2026-01-22T18:21:53Z

Yes I think that's perfect

For all docs, are we getting the latest 100 or is it undefined? I think latest 100 would make a good UX by default

Same for the version with the time picker, is that sorted?

CoenWarmer · 2026-01-22T20:43:20Z

@LucaWintergerst

For all docs, are we getting the latest 100 or is it undefined?

We're getting the latest 100.

Same for the version with the time picker, is that sorted?

We're getting 100, sorted newest to oldest.

…lure-store-as-source

tonyghiani · 2026-01-23T14:51:21Z

Looks pretty neat, the only thing I noticed is that the Failure Store source is deletable, while it shouldn't, as it's managed by us as we do for the latest samples:

tonyghiani

Looks great! 👏

elasticmachine · 2026-01-23T17:02:57Z

💚 Build Succeeded

Buildkite Build
Commit: 87691b4

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`streamsApp`	1498	1499	+1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`streamsApp`	1.6MB	1.6MB	+4.0KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id	before	after	diff
`streams`	28	29	+1

History

💚 Build #385342 succeeded 08054ea
💔 Build #385171 failed 3bfa1fd
💔 Build #385083 failed f4fe8de
💔 Build #385060 failed 8427661
💔 Build #384999 failed d459efe
💛 Build #384818 was flaky 5e91b6a

Add failure store as a data source for simulations

01d115e

CoenWarmer requested review from a team as code owners January 19, 2026 13:36

CoenWarmer changed the title ~~Add failure store as a data source for simulations~~ [Streams] Add failure store as a data source for simulations Jan 19, 2026

CoenWarmer added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting ci:beta-faster-pr-build labels Jan 19, 2026

CoenWarmer added 4 commits January 19, 2026 14:37

Merge branch 'main' into streams-add-failure-store-as-source

c2bc603

Minor cosmetic tweaks

a5cf9a5

Merge branch 'streams-add-failure-store-as-source' of github.com:Coen…

d9e802a

…Warmer/kibana into streams-add-failure-store-as-source

Fix type error

3bf6585

tonyghiani self-requested a review January 19, 2026 15:46

tonyghiani reviewed Jan 19, 2026

View reviewed changes

Add failure store enabled check and user access privileges check to f…

c172171

…rontend and backend

CoenWarmer requested a review from tonyghiani January 19, 2026 19:44

tonyghiani reviewed Jan 21, 2026

View reviewed changes

CoenWarmer added 2 commits January 21, 2026 17:48

Add failure store by default if permissions allow, use existing simul…

67202d3

…ateProcessor function

Optimizations for the processing handling, unwrap failure store docs

5b58dce

CoenWarmer requested a review from tonyghiani January 21, 2026 17:01

Merge branch 'main' into streams-add-failure-store-as-source

2a4544e

flash1293 reviewed Jan 22, 2026

View reviewed changes

Create FailureStoreNotEnabledError

29cf747

Fix bug where failure store wouldn't switch

5e91b6a

Show all documents

d459efe

CoenWarmer added 3 commits January 22, 2026 17:17

Add optional time filter

4fd2da2

Merge branch 'main' of github.com:elastic/kibana into streams-add-fai…

8427661

…lure-store-as-source

Cleanup

f4fe8de

CoenWarmer added 3 commits January 22, 2026 21:58

Fix tests

ed6e841

Merge branch 'main' of github.com:elastic/kibana into streams-add-fai…

3bfa1fd

…lure-store-as-source

Fix tests

08054ea

CoenWarmer requested a review from flash1293 January 23, 2026 12:20

Make Failure Store not a deletable data source

740edf5

CoenWarmer enabled auto-merge (squash) January 23, 2026 15:27

Merge branch 'main' into streams-add-failure-store-as-source

87691b4

tonyghiani approved these changes Jan 23, 2026

View reviewed changes

CoenWarmer merged commit df0d665 into elastic:main Jan 23, 2026
15 of 16 checks passed

kibanamachine added the v9.4.0 label Jan 23, 2026

rStelmach mentioned this pull request Jan 27, 2026

Streams KQL Search Samples Unusable in some scenarios #250327

Closed

Conversation

CoenWarmer commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Uh oh!

tonyghiani left a comment

Choose a reason for hiding this comment

Uh oh!

CoenWarmer commented Jan 19, 2026

Uh oh!

tonyghiani left a comment

Choose a reason for hiding this comment

Uh oh!

tonyghiani Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

tonyghiani Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flash1293 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CoenWarmer commented Jan 21, 2026

Uh oh!

flash1293 commented Jan 21, 2026

Uh oh!

tonyghiani commented Jan 21, 2026

Uh oh!

patpascal commented Jan 21, 2026

Uh oh!

CoenWarmer commented Jan 21, 2026

Uh oh!

flash1293 left a comment

Choose a reason for hiding this comment

Uh oh!

flash1293 Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CoenWarmer Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

flash1293 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

flash1293 Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CoenWarmer commented Jan 22, 2026

Uh oh!

flash1293 commented Jan 22, 2026

Uh oh!

tonyghiani commented Jan 22, 2026

Uh oh!

flash1293 commented Jan 22, 2026

Uh oh!

CoenWarmer commented Jan 22, 2026

Uh oh!

LucaWintergerst commented Jan 22, 2026

Uh oh!

CoenWarmer commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucaWintergerst commented Jan 22, 2026

Uh oh!

CoenWarmer commented Jan 22, 2026

Uh oh!

tonyghiani commented Jan 23, 2026

Uh oh!

tonyghiani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticmachine commented Jan 23, 2026

💚 Build Succeeded

Metrics [docs]

Module Count

CoenWarmer commented Jan 19, 2026 •

edited

Loading

flash1293 commented Jan 21, 2026 •

edited

Loading

flash1293 Jan 22, 2026 •

edited

Loading

flash1293 Jan 22, 2026 •

edited

Loading

CoenWarmer commented Jan 22, 2026 •

edited

Loading