-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding dataset downloading to kagglehub #131
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow up PR, can you add support in the Kaggle notebook resolver: https://github.com/Kaggle/kagglehub/blob/main/src/kagglehub/kaggle_cache_resolver.py
You will need to make a backend change first to add support for attaching datasets dynamically.
- Add a
DatasetReference
to theoneof datasource_reference
field here: https://github.com/Kaggle/kaggleazure/blob/2308f0d1d087c4b19946024aed8a6306f1bbf4ac/Kaggle.Sdk/kernels/kernels_service.proto#L2998 - Implement the logic to dynamically attach a dataset: https://github.com/Kaggle/kaggleazure/blob/457303918d98b8a0d0931ce2ffbc99896253df5c/Kaggle.Services.Kernels/Handlers/AttachDatasourceUsingJwtHandler.cs#L73
The easiest way to test it in the kagglehub
integration in a Kaggle notebook environment is through these steps:
- Push your changes to GitHub on a branch.
- Open a Kaggle notebook on kaggle.com
- Run:
!pip install --force-reinstall --no-deps git+https://github.com/Kaggle/kagglehub.git@NAME_OF_YOUR_BRANCH
import kagglehub
kagglehub.dataset_download("jeward/testDataset")
@@ -6,6 +6,9 @@ | |||
|
|||
from kagglehub.config import get_kaggle_api_endpoint | |||
|
|||
NUM_VERSIONED_DATASET_PARTS = 4 # e.g.: <owner>/<dataset>/versions/<version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add versions/
prefix or not?
Pros of keeping it:
- Matches the dataset URL.
Cons of keeping it:
- Handle is different from models.
@jplotts any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great work. Just a few small suggestions.
This is part 1 of adding dataset support to kagglehub. This adds the functionality of downloading.
kagglehub.dataset_download("jeward/testDataset")
Optional file path ->
kagglehub.dataset_download("jeward/testDataset", "foo.txt")
b/313706281