Skip to content

[Streams 🌊] Enrichment sampling data sources#219736

Merged
tonyghiani merged 56 commits intoelastic:mainfrom
tonyghiani:218408-control-sample-fetching
Jun 20, 2025
Merged

[Streams 🌊] Enrichment sampling data sources#219736
tonyghiani merged 56 commits intoelastic:mainfrom
tonyghiani:218408-control-sample-fetching

Conversation

@tonyghiani
Copy link
Contributor

@tonyghiani tonyghiani commented Apr 30, 2025

📓 Summary

Closes #218408

This work initially started with the introduction of a simple search bar on the streams enrichment samples, but as we realized it didn't fit well with the requirements for a smooth simulation experience, we moved in another direction.

Data sources

As we want to let users pull documents from multiple sources to simulate their processors (such as docs from Discover, failure store, custom documents pasted into the simulator, etc...), this work introduces a data source entity in the simulation playground.
On top of how it used to work, it converts the random samples previously fetched automatically to a dedicated data source.
As this becomes now a scalable concept, we provide users with the ability to add/remove/enable different data sources for the same simulation:

  • Random samples: This is always available by default to have at least a data source always available; it can still be enabled/disabled on demand.
  • KQL search: Provides a KQL search bar, similar to the one found in Discover and across Kibana, which enables patterns for pulling documents into this page from Discover or elsewhere.
  • Custom samples: Paste raw documents that will be used among the other data sources for the whole simulation.

💡 Reviewer hints

  • The data fetching now relies on the data plugin interfaces as we needed a more capable API than the _sample one (now removed), and it aligns with the data fetching practice used for the partitioning page.
  • The data source can behave differently depending on its state (enabled/disabled). To treat it as an isolated concept, a representing actor machine is introduced and the root streamEnrichment machine coordinates event-based communication as it happens already for the processors' instantiation and management.
  • The data sources are consistently persisted to the URL, with a couple of exceptions:
    • The Custom samples data source is not persisted, as it's not only descriptive of the data source configuration but it also holds the custom samples defined by the user. This could easily hit the URL limits, so we warn the user this won't be persisted anyhow.
    • The Random samples data source is always available and restored in the URL to guarantee a data source available on the page.

@tonyghiani tonyghiani added Team:obs-onboarding Observability Onboarding Team backport:version Backport to applied version labels Feature:Streams This is the label for the Streams Project v9.1.0 v8.19.0 release_note:skip Skip the PR/issue when compiling release notes labels Apr 30, 2025
@tonyghiani
Copy link
Contributor Author

@LucaWintergerst thanks for the check on functionalities!

Automatically extend data preview once a user types a new KQL query, so they always see the docs after typing without clicking again

I'd rather set the KQL preview (only this data source type as is more relevant) open by default than toggling typing as it generates quite a visual shift, I tried it and looks good!

Can’t click checkbox itself, only around it to enable/disable

I'm aware of this yes, I opened an issue to EUI, which is already fixed and should be merged with the next EUI release 👌

Are the data preview fields static? We should probably make this configurable or more flexible in the future.

I made a small change to guarantee it'll show all the fields, so also body.text will be shown among the others, but agree we can improve it! I'd rather keep this as a follow-up once we gather feedback.

The auto-apply of the kql is very aggressive. I wonder if we should either debounce more or do something more similar to discover where it only applies after pressing enter?

Agree, the experience is quite bad with the errors popping out while the user types. I updated to work on enter or button click and the experience is much better, it also leave the table in front of the users so they can still pick/copy values from the table.

@LucaWintergerst
Copy link
Contributor

great, I just checked it again and that all works now 👍

a few other minor tweaks, and one slightly larger bug:

1.

The custom samples data source cannot be persisted. It will be lost when you leave this page.

to

The custom samples will not be persisted. They will be lost when you leave the processing page.

2.

For the custom docs, instead of

[{"foo":"bar","foo2":"baz"}]
use 
[ { "@timestamp": "2025-06-17T12:00:00Z", "message": "Sample log message" } ]

3.

CleanShot 2025-06-17 at 17 35 46@2x
Change to the following, to avoid confusing users that we're not actually deleting their data.

Remove sample data source?
Removed sample data source will need to be reconfigured

4.
There is a bug with the failed % calc where it doesn't show it anymore. It's not always like that. If I had to guess it might be that we only run the simulation with 100 docs, and if the first sample source has >=100 docs it doesn't use the other sources anymore?

CleanShot.2025-06-17.at.17.39.19.mp4

@tonyghiani
Copy link
Contributor Author

tonyghiani commented Jun 18, 2025

@LucaWintergerst thanks for the detailed suggestions, I applied all the changes 👌

Also, the bug for the stats calculation was a subtle one, good catch! I updated the logic to handle % up to 1 decimal level to give more accurate stats when the samples count is > 100 docs.

Screenshot 2025-06-18 at 10 29 38

@Kerry350 Kerry350 self-requested a review June 18, 2025 08:51
Copy link
Contributor

@Kerry350 Kerry350 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @tonyghiani 👏

Just a couple of nits, and ignoring the checkbox issue which is being fixed on the EUI side.

This was a really good read — the composition of UI components and state machine refs etc was really nice 👌

*/
export interface KqlSamplesDataSource extends BaseDataSource {
type: 'kql-samples';
query: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For query, filters, and timeRange here is there a preexisting KQL type we can use, rather than redefining them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any unfortunately ,and I just reused the exported types to compose it here :( For query the type is slightly different from the one exported by the es-query package.

'xpack.streams.streamDetailView.managementTab.enrichment.dataSources.randomSamples.callout',
{
defaultMessage:
'The random samples data source cannot be deleted to guarantee available samples for the simulation. You can still disable it if you want to focus on other data sources samples.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This doesn't read well — can we change this to You can still disable it if you want to focus on samples from other data sources

const error = getFormattedError(event.error);
toasts.addError(error, {
title: i18n.translate('xpack.streams.enrichment.dataSources.dataCollectionError', {
defaultMessage: 'An issue occurred retrieving data source documents.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we change to An issue occurred retrieving documents from the data source.

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #28 / Rules Management - Rule import export API @ess @serverless @skipInServerlessMKI import_rules importing rules with an index @skipInServerless migrate pre-8.0 action connector ids should be imported into the default space importing a default-space 7.16 rule with a connector made in the default space into the default space should result in a 200

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
streamsApp 441 460 +19

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
streamsApp 518.7KB 541.0KB +22.3KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
streamsApp 10.5KB 10.8KB +271.0B
Unknown metric groups

async chunk count

id before after diff
streamsApp 7 8 +1

ESLint disabled line counts

id before after diff
streamsApp 13 12 -1

Total ESLint disabled count

id before after diff
streamsApp 17 16 -1

History

@tonyghiani tonyghiani enabled auto-merge (squash) June 20, 2025 05:20
@tonyghiani tonyghiani merged commit b759ebb into elastic:main Jun 20, 2025
10 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.19

https://github.com/elastic/kibana/actions/runs/15773250413

@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.19 Backport failed because of merge conflicts

You might need to backport the following PRs to 8.19:
- [Streams 🌊] Make management view the main page for individual stream (#224461)

Manual backport

To create the backport manually run:

node scripts/backport --pr 219736

Questions ?

Please refer to the Backport tool documentation

@tonyghiani tonyghiani added backport:version Backport to applied version labels and removed backport:version Backport to applied version labels labels Jun 23, 2025
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.19

https://github.com/elastic/kibana/actions/runs/15816253337

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Jun 23, 2025
## 📓 Summary

Closes elastic#218408

This work initially started with the introduction of a simple search bar
on the streams enrichment samples, but as we realized it didn't fit well
with the requirements for a smooth simulation experience, we moved in
another direction.

## Data sources

As we want to let users pull documents from multiple sources to simulate
their processors (such as docs from Discover, failure store, custom
documents pasted into the simulator, etc...), this work introduces a
data source entity in the simulation playground.
On top of how it used to work, it converts the random samples previously
fetched automatically to a dedicated data source.
As this becomes now a scalable concept, we provide users with the
ability to add/remove/enable different data sources for the same
simulation:
- **Random samples**: This is always available by default to have at
least a data source always available; it can still be enabled/disabled
on demand.
- **KQL search**: Provides a KQL search bar, similar to the one found in
Discover and across Kibana, which enables patterns for pulling documents
into this page from Discover or elsewhere.
- **Custom samples**: Paste raw documents that will be used among the
other data sources for the whole simulation.

## 💡 Reviewer hints

- The data fetching now relies on the `data` plugin interfaces as we
needed a more capable API than the `_sample` one (now removed), and it
aligns with the data fetching practice used for the partitioning page.
- The data source can behave differently depending on its state
(enabled/disabled). To treat it as an isolated concept, a representing
actor machine is introduced and the root streamEnrichment machine
coordinates event-based communication as it happens already for the
processors' instantiation and management.
- The data sources are consistently persisted to the URL, with a couple
of exceptions:
- The `Custom samples` data source is not persisted, as it's not only
descriptive of the data source configuration but it also holds the
custom samples defined by the user. This could easily hit the URL
limits, so we warn the user this won't be persisted anyhow.
- The `Random samples` data source is always available and restored in
the URL to guarantee a data source available on the page.

(cherry picked from commit b759ebb)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.19

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

@tonyghiani tonyghiani deleted the 218408-control-sample-fetching branch June 23, 2025 07:15
kibanamachine added a commit that referenced this pull request Jun 23, 2025
# Backport

This will backport the following commits from `main` to `8.19`:
- [[Streams 🌊] Enrichment sampling data sources
(#219736)](#219736)

<!--- Backport version: 9.6.6 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sorenlouv/backport)

<!--BACKPORT [{"author":{"name":"Marco Antonio
Ghiani","email":"marcoantonio.ghiani01@gmail.com"},"sourceCommit":{"committedDate":"2025-06-20T07:02:14Z","message":"[Streams
🌊] Enrichment sampling data sources (#219736)\n\n## 📓 Summary\n\nCloses
#218408 \n\nThis work initially started with the introduction of a
simple search bar\non the streams enrichment samples, but as we realized
it didn't fit well\nwith the requirements for a smooth simulation
experience, we moved in\nanother direction.\n\n## Data sources\n\nAs we
want to let users pull documents from multiple sources to
simulate\ntheir processors (such as docs from Discover, failure store,
custom\ndocuments pasted into the simulator, etc...), this work
introduces a\ndata source entity in the simulation playground.\nOn top
of how it used to work, it converts the random samples
previously\nfetched automatically to a dedicated data source.\nAs this
becomes now a scalable concept, we provide users with the\nability to
add/remove/enable different data sources for the same\nsimulation:\n-
**Random samples**: This is always available by default to have
at\nleast a data source always available; it can still be
enabled/disabled\non demand.\n- **KQL search**: Provides a KQL search
bar, similar to the one found in\nDiscover and across Kibana, which
enables patterns for pulling documents\ninto this page from Discover or
elsewhere.\n- **Custom samples**: Paste raw documents that will be used
among the\nother data sources for the whole simulation.\n\n## 💡 Reviewer
hints\n\n- The data fetching now relies on the `data` plugin interfaces
as we\nneeded a more capable API than the `_sample` one (now removed),
and it\naligns with the data fetching practice used for the partitioning
page.\n- The data source can behave differently depending on its
state\n(enabled/disabled). To treat it as an isolated concept, a
representing\nactor machine is introduced and the root streamEnrichment
machine\ncoordinates event-based communication as it happens already for
the\nprocessors' instantiation and management.\n- The data sources are
consistently persisted to the URL, with a couple\nof exceptions:\n- The
`Custom samples` data source is not persisted, as it's not
only\ndescriptive of the data source configuration but it also holds
the\ncustom samples defined by the user. This could easily hit the
URL\nlimits, so we warn the user this won't be persisted anyhow.\n- The
`Random samples` data source is always available and restored in\nthe
URL to guarantee a data source available on the
page.","sha":"b759ebba3da4fa3f34fa914f99017c315a3294af","branchLabelMapping":{"^v9.1.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:obs-ux-logs","backport:version","Feature:Streams","v9.1.0","v8.19.0"],"title":"[Streams
🌊] Enrichment sampling data
sources","number":219736,"url":"https://github.com/elastic/kibana/pull/219736","mergeCommit":{"message":"[Streams
🌊] Enrichment sampling data sources (#219736)\n\n## 📓 Summary\n\nCloses
#218408 \n\nThis work initially started with the introduction of a
simple search bar\non the streams enrichment samples, but as we realized
it didn't fit well\nwith the requirements for a smooth simulation
experience, we moved in\nanother direction.\n\n## Data sources\n\nAs we
want to let users pull documents from multiple sources to
simulate\ntheir processors (such as docs from Discover, failure store,
custom\ndocuments pasted into the simulator, etc...), this work
introduces a\ndata source entity in the simulation playground.\nOn top
of how it used to work, it converts the random samples
previously\nfetched automatically to a dedicated data source.\nAs this
becomes now a scalable concept, we provide users with the\nability to
add/remove/enable different data sources for the same\nsimulation:\n-
**Random samples**: This is always available by default to have
at\nleast a data source always available; it can still be
enabled/disabled\non demand.\n- **KQL search**: Provides a KQL search
bar, similar to the one found in\nDiscover and across Kibana, which
enables patterns for pulling documents\ninto this page from Discover or
elsewhere.\n- **Custom samples**: Paste raw documents that will be used
among the\nother data sources for the whole simulation.\n\n## 💡 Reviewer
hints\n\n- The data fetching now relies on the `data` plugin interfaces
as we\nneeded a more capable API than the `_sample` one (now removed),
and it\naligns with the data fetching practice used for the partitioning
page.\n- The data source can behave differently depending on its
state\n(enabled/disabled). To treat it as an isolated concept, a
representing\nactor machine is introduced and the root streamEnrichment
machine\ncoordinates event-based communication as it happens already for
the\nprocessors' instantiation and management.\n- The data sources are
consistently persisted to the URL, with a couple\nof exceptions:\n- The
`Custom samples` data source is not persisted, as it's not
only\ndescriptive of the data source configuration but it also holds
the\ncustom samples defined by the user. This could easily hit the
URL\nlimits, so we warn the user this won't be persisted anyhow.\n- The
`Random samples` data source is always available and restored in\nthe
URL to guarantee a data source available on the
page.","sha":"b759ebba3da4fa3f34fa914f99017c315a3294af"}},"sourceBranch":"main","suggestedTargetBranches":["8.19"],"targetPullRequestStates":[{"branch":"main","label":"v9.1.0","branchLabelMappingKey":"^v9.1.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/219736","number":219736,"mergeCommit":{"message":"[Streams
🌊] Enrichment sampling data sources (#219736)\n\n## 📓 Summary\n\nCloses
#218408 \n\nThis work initially started with the introduction of a
simple search bar\non the streams enrichment samples, but as we realized
it didn't fit well\nwith the requirements for a smooth simulation
experience, we moved in\nanother direction.\n\n## Data sources\n\nAs we
want to let users pull documents from multiple sources to
simulate\ntheir processors (such as docs from Discover, failure store,
custom\ndocuments pasted into the simulator, etc...), this work
introduces a\ndata source entity in the simulation playground.\nOn top
of how it used to work, it converts the random samples
previously\nfetched automatically to a dedicated data source.\nAs this
becomes now a scalable concept, we provide users with the\nability to
add/remove/enable different data sources for the same\nsimulation:\n-
**Random samples**: This is always available by default to have
at\nleast a data source always available; it can still be
enabled/disabled\non demand.\n- **KQL search**: Provides a KQL search
bar, similar to the one found in\nDiscover and across Kibana, which
enables patterns for pulling documents\ninto this page from Discover or
elsewhere.\n- **Custom samples**: Paste raw documents that will be used
among the\nother data sources for the whole simulation.\n\n## 💡 Reviewer
hints\n\n- The data fetching now relies on the `data` plugin interfaces
as we\nneeded a more capable API than the `_sample` one (now removed),
and it\naligns with the data fetching practice used for the partitioning
page.\n- The data source can behave differently depending on its
state\n(enabled/disabled). To treat it as an isolated concept, a
representing\nactor machine is introduced and the root streamEnrichment
machine\ncoordinates event-based communication as it happens already for
the\nprocessors' instantiation and management.\n- The data sources are
consistently persisted to the URL, with a couple\nof exceptions:\n- The
`Custom samples` data source is not persisted, as it's not
only\ndescriptive of the data source configuration but it also holds
the\ncustom samples defined by the user. This could easily hit the
URL\nlimits, so we warn the user this won't be persisted anyhow.\n- The
`Random samples` data source is always available and restored in\nthe
URL to guarantee a data source available on the
page.","sha":"b759ebba3da4fa3f34fa914f99017c315a3294af"}},{"branch":"8.19","label":"v8.19.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Marco Antonio Ghiani <marcoantonio.ghiani01@gmail.com>
akowalska622 pushed a commit to akowalska622/kibana that referenced this pull request Jun 25, 2025
## 📓 Summary

Closes elastic#218408 

This work initially started with the introduction of a simple search bar
on the streams enrichment samples, but as we realized it didn't fit well
with the requirements for a smooth simulation experience, we moved in
another direction.

## Data sources

As we want to let users pull documents from multiple sources to simulate
their processors (such as docs from Discover, failure store, custom
documents pasted into the simulator, etc...), this work introduces a
data source entity in the simulation playground.
On top of how it used to work, it converts the random samples previously
fetched automatically to a dedicated data source.
As this becomes now a scalable concept, we provide users with the
ability to add/remove/enable different data sources for the same
simulation:
- **Random samples**: This is always available by default to have at
least a data source always available; it can still be enabled/disabled
on demand.
- **KQL search**: Provides a KQL search bar, similar to the one found in
Discover and across Kibana, which enables patterns for pulling documents
into this page from Discover or elsewhere.
- **Custom samples**: Paste raw documents that will be used among the
other data sources for the whole simulation.

## 💡 Reviewer hints

- The data fetching now relies on the `data` plugin interfaces as we
needed a more capable API than the `_sample` one (now removed), and it
aligns with the data fetching practice used for the partitioning page.
- The data source can behave differently depending on its state
(enabled/disabled). To treat it as an isolated concept, a representing
actor machine is introduced and the root streamEnrichment machine
coordinates event-based communication as it happens already for the
processors' instantiation and management.
- The data sources are consistently persisted to the URL, with a couple
of exceptions:
- The `Custom samples` data source is not persisted, as it's not only
descriptive of the data source configuration but it also holds the
custom samples defined by the user. This could easily hit the URL
limits, so we warn the user this won't be persisted anyhow.
- The `Random samples` data source is always available and restored in
the URL to guarantee a data source available on the page.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:version Backport to applied version labels Feature:Streams This is the label for the Streams Project release_note:skip Skip the PR/issue when compiling release notes Team:obs-onboarding Observability Onboarding Team v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Streams 🌊] Enrichment samples controls

7 participants