Skip to content

Comments

Load huggingface content datasets#224543

Merged
dgieselaar merged 13 commits intoelastic:mainfrom
dgieselaar:hf-dataset-loader
Jun 26, 2025
Merged

Load huggingface content datasets#224543
dgieselaar merged 13 commits intoelastic:mainfrom
dgieselaar:hf-dataset-loader

Conversation

@dgieselaar
Copy link
Contributor

@dgieselaar dgieselaar commented Jun 19, 2025

Implements a huggingface dataset loader for RAG evals - see x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md. Additionally, a @kbn/cache-cli tool was added that allows tooling authors to cache to disk (possibly remote storage later).

Used o3 for finding datasets on HuggingFace and doing an initial pass on a line-by-line dataset processor (see conversation)

Libraries added:

  • cache-manager, cache-manager-fs-hash, keyv, @types/cache-manager-fs-hash: caching libraries and plugins. could not find any existing caching libraries in the repo.
  • @huggingface/hub: api client for HF.

@dgieselaar dgieselaar requested a review from a team as a code owner June 19, 2025 10:22
@dgieselaar dgieselaar added release_note:skip Skip the PR/issue when compiling release notes v9.1.0 v8.19.0 labels Jun 19, 2025
@kibanamachine
Copy link
Contributor

kibanamachine commented Jun 19, 2025

Dependency Review Bot Analysis 🔍

Found 5 new third-party dependencies:

Package Version Vulnerabilities Health Score
cache-manager ^7.0.0 🔴 C: 0, 🟠 H: 0, 🟡 M: 0, 🟢 L: 0 cache-manager
keyv ^5.3.4 🔴 C: 0, 🟠 H: 0, 🟡 M: 0, 🟢 L: 0 keyv
@huggingface/hub ^2.2.0 🔴 C: 0, 🟠 H: 0, 🟡 M: 0, 🟢 L: 0 @huggingface/hub
@types/cache-manager-fs-hash ^0.0.5 🔴 C: 0, 🟠 H: 0, 🟡 M: 0, 🟢 L: 0 @types/cache-manager-fs-hash
cache-manager-fs-hash ^2.0.0 🔴 C: 0, 🟠 H: 0, 🟡 M: 0, 🟢 L: 0 cache-manager-fs-hash

Self Checklist

To help with the review, please update the PR description to address the following points for each new third-party dependency listed above:

  • Purpose: What is this dependency used for? Briefly explain its role in your changes.
  • Justification: Why is adding this dependency the best approach?
  • Alternatives explored: Were other options considered (e.g., using existing internal libraries/utilities, implementing the functionality directly)? If so, why was this dependency chosen over them?
  • Existing dependencies: Does Kibana have a dependency providing similar functionality? If so, why is the new one preferred?

Thank you for providing this information!

@kibanamachine kibanamachine requested a review from a team June 19, 2025 10:22
@dgieselaar dgieselaar added the backport:version Backport to applied version labels label Jun 19, 2025
Copy link
Contributor

@elena-shostak elena-shostak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @huggingface/hub is intended only for local usage can we move it to dev dependencies? Same for caching libraries

src/platform/packages/shared/kbn-avc-banner @elastic/security-defend-workflows
src/platform/packages/shared/kbn-axe-config @elastic/appex-qa
src/platform/packages/shared/kbn-babel-register @elastic/kibana-operations
src/platform/packages/shared/kbn-cache-cli @elastic/kibana-operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please add it to CodeQL ignore paths?

x-pack/platform/packages/shared/index-lifecycle-management/index_lifecycle_management_common_shared @elastic/kibana-management
x-pack/platform/packages/shared/index-management/index_management_shared_types @elastic/kibana-management
x-pack/platform/packages/shared/kbn-ai-assistant @elastic/search-kibana @elastic/obs-ai-assistant
x-pack/platform/packages/shared/kbn-ai-tools-cli @elastic/appex-ai-infra
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please add it to CodeQL ignore paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does that do and why should this be excluded? is it a blanket policy for CLI tools?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've briefly covered the reason in your other PR - #218694 (review). Ideally, we wouldn't exclude or ignore anything at all, and hopefully, we'll get there eventually. However, due to our current constraints (CodeQL is slow, resource-hungry, and not super friendly to incremental tests/changes), we're trying to be pragmatic and only cover non-dev/non-test code. At the same time, I'm not yet confident it'll become a permanent policy (at least I hope so), so we haven't documented this anywhere and are handling it on a case-by-case basis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azasypkin thanks Oleg and apologies for missing that. Is this something we can automate? The package has the devOnly prop. If we want to do it manually, does it make sense to exclude these packages as they are very small? I'm ok with adding these manually but doing this every time is probably not sustainable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of time I've added them to the CodeQL ignore paths – it'd be great if we can automate this but I also understand it might not be worth it if we are not sure what the long term plan is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we definitely want and need to add more automation here, and we're moving (albeit slowly) in that direction.

@dgieselaar
Copy link
Contributor Author

Since @huggingface/hub is intended only for local usage can we move it to dev dependencies? Same for caching libraries

Done!

@kibanamachine kibanamachine requested a review from a team June 19, 2025 14:07
super(options);
super({
...options,
productCheck: undefined,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please leave a comment why we need to disable it?

dgieselaar and others added 4 commits June 20, 2025 13:32
…t --include-path /api/status --include-path /api/alerting/rule/ --include-path /api/alerting/rules --include-path /api/actions --include-path /api/security/role --include-path /api/spaces --include-path /api/streams --include-path /api/fleet --include-path /api/dashboards --include-path /api/saved_objects/_import --include-path /api/saved_objects/_export --include-path /api/maintenance_window --update'
Copy link
Contributor

@elena-shostak elena-shostak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dgieselaar
Copy link
Contributor Author

@elasticmachine merge upstream

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/ai-tools-cli - 11 +11
@kbn/cache-cli - 9 +9
total +20

Any counts in public APIs

Total count of every any typed public API. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats any for more detailed information.

id before after diff
@kbn/ai-tools-cli - 1 +1

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
@kbn/cache-cli - 2 +2
Unknown metric groups

API count

id before after diff
@kbn/ai-tools-cli - 18 +18
@kbn/cache-cli - 9 +9
total +27

History

@dgieselaar dgieselaar merged commit 7d20301 into elastic:main Jun 26, 2025
13 checks passed
@dgieselaar dgieselaar deleted the hf-dataset-loader branch June 26, 2025 15:24
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.19

https://github.com/elastic/kibana/actions/runs/15905968267

@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.19 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 224543

Questions ?

Please refer to the Backport tool documentation

@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Jun 30, 2025
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 224543 locally
cc: @dgieselaar

@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create automatically backports add a backport:* label or prevent reminders by adding the backport:skip label.
You can also create backports manually by running node scripts/backport --pr 224543 locally
cc: @dgieselaar

@dgieselaar
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.19

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

dgieselaar added a commit to dgieselaar/kibana that referenced this pull request Jul 2, 2025
Implements a huggingface dataset loader for RAG evals - see
[x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md](https://github.com/dgieselaar/kibana/blob/hf-dataset-loader/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md).
Additionally, a `@kbn/cache-cli` tool was added that allows tooling
authors to cache to disk (possibly remote storage later).

Used o3 for finding datasets on HuggingFace and doing an initial pass on
a line-by-line dataset processor ([see
conversation](https://chatgpt.com/share/6853e49a-e870-8000-9c65-f7a5a3a72af0))

Libraries added:

- `cache-manager`, `cache-manager-fs-hash`, `keyv`,
`@types/cache-manager-fs-hash`: caching libraries and plugins. could not
find any existing caching libraries in the repo.
- `@huggingface/hub`: api client for HF.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
(cherry picked from commit 7d20301)

# Conflicts:
#	.github/CODEOWNERS
#	tsconfig.base.json
#	yarn.lock
@kibanamachine
Copy link
Contributor

Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync.
cc: @dgieselaar

dgieselaar added a commit that referenced this pull request Jul 3, 2025
# Backport

This will backport the following commits from `main` to `8.19`:
- [Load huggingface content datasets
(#224543)](#224543)

<!--- Backport version: 10.0.1 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sorenlouv/backport)

<!--BACKPORT [{"author":{"name":"Dario
Gieselaar","email":"dario.gieselaar@elastic.co"},"sourceCommit":{"committedDate":"2025-06-26T15:24:45Z","message":"Load
huggingface content datasets (#224543)\n\nImplements a huggingface
dataset loader for RAG evals -
see\n[x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md](https://github.com/dgieselaar/kibana/blob/hf-dataset-loader/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md).\nAdditionally,
a `@kbn/cache-cli` tool was added that allows tooling\nauthors to cache
to disk (possibly remote storage later).\n\nUsed o3 for finding datasets
on HuggingFace and doing an initial pass on\na line-by-line dataset
processor
([see\nconversation](https://chatgpt.com/share/6853e49a-e870-8000-9c65-f7a5a3a72af0))\n\nLibraries
added:\n\n- `cache-manager`, `cache-manager-fs-hash`,
`keyv`,\n`@types/cache-manager-fs-hash`: caching libraries and plugins.
could not\nfind any existing caching libraries in the repo.\n-
`@huggingface/hub`: api client for HF.\n\n---------\n\nCo-authored-by:
kibanamachine
<42973632+kibanamachine@users.noreply.github.com>\nCo-authored-by:
Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"7d203012892f543f24027f268ca19c8d51990b27","branchLabelMapping":{"^v9.1.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","backport
missing","backport:version","v9.1.0","v8.19.0"],"title":"Load
huggingface content
datasets","number":224543,"url":"https://github.com/elastic/kibana/pull/224543","mergeCommit":{"message":"Load
huggingface content datasets (#224543)\n\nImplements a huggingface
dataset loader for RAG evals -
see\n[x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md](https://github.com/dgieselaar/kibana/blob/hf-dataset-loader/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md).\nAdditionally,
a `@kbn/cache-cli` tool was added that allows tooling\nauthors to cache
to disk (possibly remote storage later).\n\nUsed o3 for finding datasets
on HuggingFace and doing an initial pass on\na line-by-line dataset
processor
([see\nconversation](https://chatgpt.com/share/6853e49a-e870-8000-9c65-f7a5a3a72af0))\n\nLibraries
added:\n\n- `cache-manager`, `cache-manager-fs-hash`,
`keyv`,\n`@types/cache-manager-fs-hash`: caching libraries and plugins.
could not\nfind any existing caching libraries in the repo.\n-
`@huggingface/hub`: api client for HF.\n\n---------\n\nCo-authored-by:
kibanamachine
<42973632+kibanamachine@users.noreply.github.com>\nCo-authored-by:
Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"7d203012892f543f24027f268ca19c8d51990b27"}},"sourceBranch":"main","suggestedTargetBranches":["8.19"],"targetPullRequestStates":[{"branch":"main","label":"v9.1.0","branchLabelMappingKey":"^v9.1.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/224543","number":224543,"mergeCommit":{"message":"Load
huggingface content datasets (#224543)\n\nImplements a huggingface
dataset loader for RAG evals -
see\n[x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md](https://github.com/dgieselaar/kibana/blob/hf-dataset-loader/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md).\nAdditionally,
a `@kbn/cache-cli` tool was added that allows tooling\nauthors to cache
to disk (possibly remote storage later).\n\nUsed o3 for finding datasets
on HuggingFace and doing an initial pass on\na line-by-line dataset
processor
([see\nconversation](https://chatgpt.com/share/6853e49a-e870-8000-9c65-f7a5a3a72af0))\n\nLibraries
added:\n\n- `cache-manager`, `cache-manager-fs-hash`,
`keyv`,\n`@types/cache-manager-fs-hash`: caching libraries and plugins.
could not\nfind any existing caching libraries in the repo.\n-
`@huggingface/hub`: api client for HF.\n\n---------\n\nCo-authored-by:
kibanamachine
<42973632+kibanamachine@users.noreply.github.com>\nCo-authored-by:
Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"7d203012892f543f24027f268ca19c8d51990b27"}},{"branch":"8.19","label":"v8.19.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Jul 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants