Load huggingface content datasets#224543
Conversation
elena-shostak
left a comment
There was a problem hiding this comment.
Since @huggingface/hub is intended only for local usage can we move it to dev dependencies? Same for caching libraries
| src/platform/packages/shared/kbn-avc-banner @elastic/security-defend-workflows | ||
| src/platform/packages/shared/kbn-axe-config @elastic/appex-qa | ||
| src/platform/packages/shared/kbn-babel-register @elastic/kibana-operations | ||
| src/platform/packages/shared/kbn-cache-cli @elastic/kibana-operations |
| x-pack/platform/packages/shared/index-lifecycle-management/index_lifecycle_management_common_shared @elastic/kibana-management | ||
| x-pack/platform/packages/shared/index-management/index_management_shared_types @elastic/kibana-management | ||
| x-pack/platform/packages/shared/kbn-ai-assistant @elastic/search-kibana @elastic/obs-ai-assistant | ||
| x-pack/platform/packages/shared/kbn-ai-tools-cli @elastic/appex-ai-infra |
There was a problem hiding this comment.
what does that do and why should this be excluded? is it a blanket policy for CLI tools?
There was a problem hiding this comment.
I've briefly covered the reason in your other PR - #218694 (review). Ideally, we wouldn't exclude or ignore anything at all, and hopefully, we'll get there eventually. However, due to our current constraints (CodeQL is slow, resource-hungry, and not super friendly to incremental tests/changes), we're trying to be pragmatic and only cover non-dev/non-test code. At the same time, I'm not yet confident it'll become a permanent policy (at least I hope so), so we haven't documented this anywhere and are handling it on a case-by-case basis.
There was a problem hiding this comment.
@azasypkin thanks Oleg and apologies for missing that. Is this something we can automate? The package has the devOnly prop. If we want to do it manually, does it make sense to exclude these packages as they are very small? I'm ok with adding these manually but doing this every time is probably not sustainable.
There was a problem hiding this comment.
In the interest of time I've added them to the CodeQL ignore paths – it'd be great if we can automate this but I also understand it might not be worth it if we are not sure what the long term plan is.
There was a problem hiding this comment.
Yeah, we definitely want and need to add more automation here, and we're moving (albeit slowly) in that direction.
Done! |
| super(options); | ||
| super({ | ||
| ...options, | ||
| productCheck: undefined, |
There was a problem hiding this comment.
could you please leave a comment why we need to disable it?
…t --include-path /api/status --include-path /api/alerting/rule/ --include-path /api/alerting/rules --include-path /api/actions --include-path /api/security/role --include-path /api/spaces --include-path /api/streams --include-path /api/fleet --include-path /api/dashboards --include-path /api/saved_objects/_import --include-path /api/saved_objects/_export --include-path /api/maintenance_window --update'
|
@elasticmachine merge upstream |
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Any counts in public APIs
Public APIs missing exports
Unknown metric groupsAPI count
History
|
|
Starting backport for target branches: 8.19 https://github.com/elastic/kibana/actions/runs/15905968267 |
💔 All backports failed
Manual backportTo create the backport manually run: Questions ?Please refer to the Backport tool documentation |
|
Friendly reminder: Looks like this PR hasn’t been backported yet. |
|
Friendly reminder: Looks like this PR hasn’t been backported yet. |
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
Implements a huggingface dataset loader for RAG evals - see [x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md](https://github.com/dgieselaar/kibana/blob/hf-dataset-loader/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md). Additionally, a `@kbn/cache-cli` tool was added that allows tooling authors to cache to disk (possibly remote storage later). Used o3 for finding datasets on HuggingFace and doing an initial pass on a line-by-line dataset processor ([see conversation](https://chatgpt.com/share/6853e49a-e870-8000-9c65-f7a5a3a72af0)) Libraries added: - `cache-manager`, `cache-manager-fs-hash`, `keyv`, `@types/cache-manager-fs-hash`: caching libraries and plugins. could not find any existing caching libraries in the repo. - `@huggingface/hub`: api client for HF. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> (cherry picked from commit 7d20301) # Conflicts: # .github/CODEOWNERS # tsconfig.base.json # yarn.lock
|
Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync. |
# Backport This will backport the following commits from `main` to `8.19`: - [Load huggingface content datasets (#224543)](#224543) <!--- Backport version: 10.0.1 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sorenlouv/backport) <!--BACKPORT [{"author":{"name":"Dario Gieselaar","email":"dario.gieselaar@elastic.co"},"sourceCommit":{"committedDate":"2025-06-26T15:24:45Z","message":"Load huggingface content datasets (#224543)\n\nImplements a huggingface dataset loader for RAG evals - see\n[x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md](https://github.com/dgieselaar/kibana/blob/hf-dataset-loader/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md).\nAdditionally, a `@kbn/cache-cli` tool was added that allows tooling\nauthors to cache to disk (possibly remote storage later).\n\nUsed o3 for finding datasets on HuggingFace and doing an initial pass on\na line-by-line dataset processor ([see\nconversation](https://chatgpt.com/share/6853e49a-e870-8000-9c65-f7a5a3a72af0))\n\nLibraries added:\n\n- `cache-manager`, `cache-manager-fs-hash`, `keyv`,\n`@types/cache-manager-fs-hash`: caching libraries and plugins. could not\nfind any existing caching libraries in the repo.\n- `@huggingface/hub`: api client for HF.\n\n---------\n\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"7d203012892f543f24027f268ca19c8d51990b27","branchLabelMapping":{"^v9.1.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","backport missing","backport:version","v9.1.0","v8.19.0"],"title":"Load huggingface content datasets","number":224543,"url":"https://github.com/elastic/kibana/pull/224543","mergeCommit":{"message":"Load huggingface content datasets (#224543)\n\nImplements a huggingface dataset loader for RAG evals - see\n[x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md](https://github.com/dgieselaar/kibana/blob/hf-dataset-loader/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md).\nAdditionally, a `@kbn/cache-cli` tool was added that allows tooling\nauthors to cache to disk (possibly remote storage later).\n\nUsed o3 for finding datasets on HuggingFace and doing an initial pass on\na line-by-line dataset processor ([see\nconversation](https://chatgpt.com/share/6853e49a-e870-8000-9c65-f7a5a3a72af0))\n\nLibraries added:\n\n- `cache-manager`, `cache-manager-fs-hash`, `keyv`,\n`@types/cache-manager-fs-hash`: caching libraries and plugins. could not\nfind any existing caching libraries in the repo.\n- `@huggingface/hub`: api client for HF.\n\n---------\n\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"7d203012892f543f24027f268ca19c8d51990b27"}},"sourceBranch":"main","suggestedTargetBranches":["8.19"],"targetPullRequestStates":[{"branch":"main","label":"v9.1.0","branchLabelMappingKey":"^v9.1.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/224543","number":224543,"mergeCommit":{"message":"Load huggingface content datasets (#224543)\n\nImplements a huggingface dataset loader for RAG evals - see\n[x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md](https://github.com/dgieselaar/kibana/blob/hf-dataset-loader/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md).\nAdditionally, a `@kbn/cache-cli` tool was added that allows tooling\nauthors to cache to disk (possibly remote storage later).\n\nUsed o3 for finding datasets on HuggingFace and doing an initial pass on\na line-by-line dataset processor ([see\nconversation](https://chatgpt.com/share/6853e49a-e870-8000-9c65-f7a5a3a72af0))\n\nLibraries added:\n\n- `cache-manager`, `cache-manager-fs-hash`, `keyv`,\n`@types/cache-manager-fs-hash`: caching libraries and plugins. could not\nfind any existing caching libraries in the repo.\n- `@huggingface/hub`: api client for HF.\n\n---------\n\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"7d203012892f543f24027f268ca19c8d51990b27"}},{"branch":"8.19","label":"v8.19.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Implements a huggingface dataset loader for RAG evals - see x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md. Additionally, a
@kbn/cache-clitool was added that allows tooling authors to cache to disk (possibly remote storage later).Used o3 for finding datasets on HuggingFace and doing an initial pass on a line-by-line dataset processor (see conversation)
Libraries added:
cache-manager,cache-manager-fs-hash,keyv,@types/cache-manager-fs-hash: caching libraries and plugins. could not find any existing caching libraries in the repo.@huggingface/hub: api client for HF.