Skip to content

[Streams] Add failure store as a data source for simulations#249559

Merged
CoenWarmer merged 20 commits intoelastic:mainfrom
CoenWarmer:streams-add-failure-store-as-source
Jan 23, 2026
Merged

[Streams] Add failure store as a data source for simulations#249559
CoenWarmer merged 20 commits intoelastic:mainfrom
CoenWarmer:streams-add-failure-store-as-source

Conversation

@CoenWarmer
Copy link
Copy Markdown
Contributor

@CoenWarmer CoenWarmer commented Jan 19, 2026

Resolves https://github.com/elastic/streams-program/issues/268

Summary

This adds the Failure Store as an option when running simulations.

Screen.Recording.2026-01-19.at.14.43.56.mov

Details

  • Adds an internal endpoint GET /internal/streams/{name}/processing/_failure_store_samples which returns samples that are in the failure store for a stream which will have the processors applied of all the ancestor streams.
  • Adds failure-store as a DataSourceActor in XState

@CoenWarmer CoenWarmer requested review from a team as code owners January 19, 2026 13:36
@CoenWarmer CoenWarmer changed the title Add failure store as a data source for simulations [Streams] Add failure store as a data source for simulations Jan 19, 2026
@CoenWarmer CoenWarmer added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting ci:beta-faster-pr-build labels Jan 19, 2026
@tonyghiani tonyghiani self-requested a review January 19, 2026 15:46
Copy link
Copy Markdown
Contributor

@tonyghiani tonyghiani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check all the functionalities and the code yet, but something we should take care for this feature to be enabled is privileges and feature enablement.

If the failure store is not enabled or the user doesn't have privileges to access it, we shouldn't show the data source and the add action.

@CoenWarmer CoenWarmer requested a review from tonyghiani January 19, 2026 19:44
@CoenWarmer
Copy link
Copy Markdown
Contributor Author

If the failure store is not enabled or the user doesn't have privileges to access it, we shouldn't show the data source and the add action.

Good point. Updated.

Copy link
Copy Markdown
Contributor

@tonyghiani tonyghiani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall goog job, I left some note for changes we'd need to do.

Something I noticed playing with it on the UI is that we are probably doing some unnecessary steps for using this data source. The current approach treats the failure store data source like a kql or custom samples one, which are configurable data sources and can be spawned multiple times.

WRT failure store, it's more similar to the latest samples source.
I think we could simplify the UI and the discoverability of this data source if we simply keep it available in the dropdown and flyout as we do for the latest samples, having it available only for users with the right privileges and when failure store it's enabled.

@patpascal @LucaWintergerst wdyt?

/**
* Runs the ingest pipeline simulation with the given processors on the documents.
*/
async function simulateWithProcessors({
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: there is already the simulateProcessing handler that deals with a lot of edge cases for some processor types, such as manual ingest pipeline etc... this is probably a good place where we should re-use it, as we expect all those edge cases to be handled correctly.

}: FailureStoreSamplesDeps): Promise<FailureStoreSamplesResponse> => {
const { name } = params.path;
const size = params.query?.size ?? DEFAULT_SAMPLE_SIZE;
const start = params.query?.start;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For failure store samples, we should query the failure store for documents that failed after the last update on the stream processing, as the user might have already made a change that fixes how the failure store documents were breaking during processing.

For this, you can use the stream.ingest.processing.updated_at property.

@flash1293
Copy link
Copy Markdown
Contributor

flash1293 commented Jan 21, 2026

High level comments, I just checked out the recording in the description:

  • The samples should "unwrap" the error message part and stuff, that isn't what we want to simulate, we just want the original document at the root, not in a field called document. Otherwise you can't fix your processing for newly incoming docs based on that
  • Should we have some additional affordance if there are (recent) docs in the failure store to pull them in? Right now it's really well hidden

@CoenWarmer
Copy link
Copy Markdown
Contributor Author

@tonyghiani @flash1293

Thanks for the reviews! Just so I get this right: we want the failure store option to be added to the drop down that currently shows "Latest samples", instead of going through the steps of first adding it via the control that also allows adding custom KQL samples?

@flash1293
Copy link
Copy Markdown
Contributor

Thanks for the reviews! Just so I get this right: we want the failure store option to be added to the drop down that currently shows "Latest samples", instead of going through the steps of first adding it via the control that also allows adding custom KQL samples?

Not sure, that's one option but I'm not sure whether it's the right one. @patpascal what do you think?

@tonyghiani
Copy link
Copy Markdown
Contributor

Thanks for the reviews! Just so I get this right: we want the failure store option to be added to the drop down that currently shows "Latest samples", instead of going through the steps of first adding it via the control that also allows adding custom KQL samples?

That's how I see that the simplest possible (just there when available), but let's wait for Patri's feedback

@patpascal
Copy link
Copy Markdown
Contributor

I think that if users have failure store enabled, showing it as a selectable option, without requiring them to go through the extra steps to add it, would make things easier for them. So I agree.

@CoenWarmer CoenWarmer requested a review from tonyghiani January 21, 2026 17:01
@CoenWarmer
Copy link
Copy Markdown
Contributor Author

  • The samples should "unwrap" the error message part and stuff, that isn't what we want to simulate, we just want the original document at the root, not in a field called document. Otherwise you can't fix your processing for newly incoming docs based on that

Addressed with 5b58dce

  • Should we have some additional affordance if there are (recent) docs in the failure store to pull them in? Right now it's really well hidden

Addressed as well

Copy link
Copy Markdown
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some kind of state issue:

  • Go to processing tab
  • Select failure store from the list
  • URL state is updated, but nothing changes
  • Refresh
  • It starts working

Works really well otherwise!

dataStream: dataStream as DataStreamWithFailureStore,
});
if (!isEnabledFailureStore(effectiveFailureStore)) {
throw new SecurityError(`Failure store is not enabled for stream ${params.path.name}`);
Copy link
Copy Markdown
Contributor

@flash1293 flash1293 Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: That's not really a security error, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I added a specific FailureStoreNotEnabledError error instead.

* Fetches documents from the failure store and applies all configured processors
* from parent streams to transform them.
*
* Only documents that failed after the most recent processing update are returned,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sold on this, this might be too clever - what if there was an unrelated processing change? This would render the data source useless, even if the user knows what's going on. Processing changes can be messy, what if the user makes a change, then reverts it? They wouldn't be able to get samples anymore.

I think it's OK to return all failure store docs, if processing configurations have been fixed since it's fine anyway because we simulate this updated processing so it won't lead to an error.

Copy link
Copy Markdown
Contributor

@flash1293 flash1293 Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have it as an optional setting in the details (or maybe just a general KQL query in there?):

Screenshot 2026-01-22 at 09 50 44

But I don't think it's that important actually

@CoenWarmer
Copy link
Copy Markdown
Contributor Author

@tonyghiani @flash1293

For failure store samples, we should query the failure store for documents that failed after the last update on the stream processing, as the user might have already made a change that fixes how the failure store documents were breaking during processing.

For this, you can use the stream.ingest.processing.updated_at property.


Not sold on this, this might be too clever - what if there was an unrelated processing change? This would render the data source useless, even if the user knows what's going on. Processing changes can be messy, what if the user makes a change, then reverts it? They wouldn't be able to get samples anymore.

I think it's OK to return all failure store docs, if processing configurations have been fixed since it's fine anyway because we simulate this updated processing so it won't lead to an error.

These two comments seem to conflict - one suggests filtering by updated_at, the other suggests returning all docs. Should we just return all docs? The simulation will show whether they're still failing with the current processing config, and filtering could hide useful historical failures if unrelated changes update the timestamp.

@flash1293
Copy link
Copy Markdown
Contributor

They do, sorry for that. Happy to discuss @tonyghiani what do you think after what I wrote in my comment?

@tonyghiani
Copy link
Copy Markdown
Contributor

I see your point Joe, it's a right reasoning about the counter effect of my suggestion. Still, in both cases, it would be weird if we keep getting documents that are not actually failing anymore.

There could be another option, but that would probably be a bit overkill:

  • Get the samples from failure store
  • Process them through the whole pipeline, including the current stream.
  • Keep only the document that somewhere failed in the whole processing, meaning something in the pipeline is still broken just for them.
  • Re-process those samples with the parent processing only (as it works now) and return the final samples.

This allows to return only samples that are not currently going successfully through the pipeline and excludes those that were already fixed by some user changes. It might work well, although it complicates a bit the API. @flash1293 @CoenWarmer wdyt?

@flash1293
Copy link
Copy Markdown
Contributor

Can we have @LucaWintergerst make a quick call on this to not block the task?

@CoenWarmer
Copy link
Copy Markdown
Contributor Author

There is some kind of state issue:

  • Go to processing tab
  • Select failure store from the list
  • URL state is updated, but nothing changes
  • Refresh
  • It starts working

Works really well otherwise!

Fixed with 5e91b6a

@LucaWintergerst
Copy link
Copy Markdown
Contributor

I would suggest we go with all docs, and optionally offer the user an option in the settings where they can set a timepicker for this source, like we'd do it today for KQL
I find this to be more predictable
and if it's already fixed, great, then the user just sees that as part of the regular processing UI
and if it's mixed, even better, then the user can make sure that their changes don't negatively affect other recent failures

@CoenWarmer
Copy link
Copy Markdown
Contributor Author

CoenWarmer commented Jan 22, 2026

I would suggest we go with all docs, and optionally offer the user an option in the settings where they can set a timepicker for this source, like we'd do it today for KQL I find this to be more predictable and if it's already fixed, great, then the user just sees that as part of the regular processing UI and if it's mixed, even better, then the user can make sure that their changes don't negatively affect other recent failures

@LucaWintergerst @flash1293 @tonyghiani @patpascal

Like this?

Screen.Recording.2026-01-22.at.17.14.31.mov

@LucaWintergerst
Copy link
Copy Markdown
Contributor

Yes I think that's perfect

For all docs, are we getting the latest 100 or is it undefined? I think latest 100 would make a good UX by default

Same for the version with the time picker, is that sorted?

@CoenWarmer
Copy link
Copy Markdown
Contributor Author

@LucaWintergerst

For all docs, are we getting the latest 100 or is it undefined?

We're getting the latest 100.

Same for the version with the time picker, is that sorted?

We're getting 100, sorted newest to oldest.

@CoenWarmer CoenWarmer requested a review from flash1293 January 23, 2026 12:20
@tonyghiani
Copy link
Copy Markdown
Contributor

Looks pretty neat, the only thing I noticed is that the Failure Store source is deletable, while it shouldn't, as it's managed by us as we do for the latest samples:

Screenshot 2026-01-23 at 15 48 41

@CoenWarmer CoenWarmer enabled auto-merge (squash) January 23, 2026 15:27
Copy link
Copy Markdown
Contributor

@tonyghiani tonyghiani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 👏

@CoenWarmer CoenWarmer merged commit df0d665 into elastic:main Jan 23, 2026
15 of 16 checks passed
@elasticmachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
streamsApp 1498 1499 +1

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
streamsApp 1.6MB 1.6MB +4.0KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
streams 28 29 +1

History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants