-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds dataset cache resolver #134
Conversation
5f2586b
to
8e75538
Compare
Did you test this in a Kaggle notebook? If so, it would be great to attach a screencast (make sure the dataset is attached) in the PR description. You will need to run:
And you can use https://screencast.googleplex.com/ to record and share the link. |
Just updated the description 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I noticed a bug in the backend logic (this should be addressed in a followup PR to our kaggle webtier repo).
If I run:
kagglehub.dataset_download("rosebv/test-versioned/versions/1")
It correctly attaches version 1 of my dataset (it has 3 versions) and attaches files for that version.
However, in the notebook editor UI, the version is not pinned to version 3. And if I recreate my notebook, then, version 3 is attached. The reason is that by default for datasets, the latest version is attached unless you pin a specific version: https://screenshot.googleplex.com/78NtnruigbNFgya
from .server_stubs import serv | ||
|
||
INVALID_ARCHIVE_DATASET_HANDLE = "invalid/invalid/invalid/invalid/invalid" | ||
VERSIONED_DATASET_HANDLE = "sarahjeffreson/featured-spotify-artiststracks-with-metadata/versions/1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orthogonal to this PR but the fact that the handle versioning is different from a model may be bad UX.
For models, the version is added with a /{VERSION_NUMBER}
suffix, not /versions/{VERSION_NUMBER}
.
However, I do understand that we chose it to ensure it aligns with the dataset URL format.
Curious to have @goeffthomas / @jplotts thoughts on this.
We should not block merging this PR on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a bit rough since Models routing is now very different from Datasets. Would it be a bad idea to support both for Datasets? I know for Models, the UI now has the shortened URLs to match the handles. Being able to support both would mean that if we did want Datasets to follow that pattern, we could do it at our leisure with no impact to kagglehub
.
Add dataset cache resolver for attaching datasets via kagglehub in a Kaggle Notebook.
Demo in Kaggle Notebook
b/356363103