Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds dataset cache resolver #134

Merged
merged 5 commits into from
Jul 31, 2024
Merged

Adds dataset cache resolver #134

merged 5 commits into from
Jul 31, 2024

Conversation

jeward414
Copy link
Contributor

@jeward414 jeward414 commented Jun 6, 2024

Add dataset cache resolver for attaching datasets via kagglehub in a Kaggle Notebook.

Demo in Kaggle Notebook

b/356363103

@rosbo rosbo mentioned this pull request Jun 10, 2024
@jeward414 jeward414 force-pushed the datasets-resolver branch from 5f2586b to 8e75538 Compare July 30, 2024 00:34
@jeward414 jeward414 marked this pull request as ready for review July 30, 2024 17:31
@jeward414 jeward414 requested a review from rosbo July 30, 2024 17:31
@rosbo
Copy link
Contributor

rosbo commented Jul 30, 2024

Did you test this in a Kaggle notebook? If so, it would be great to attach a screencast (make sure the dataset is attached) in the PR description.

You will need to run:

# Re-install kagglehub from this branch
!pip install --force-reinstall --no-deps git+https://github.com/Kaggle/kagglehub.git@datasets-resolver

import kagglehub

kagglehub.dataset_download("jessicali9530/animal-crossing-new-horizons-nookplaza-dataset")

And you can use https://screencast.googleplex.com/ to record and share the link.

@jeward414
Copy link
Contributor Author

Did you test this in a Kaggle notebook? If so, it would be great to attach a screencast (make sure the dataset is attached) in the PR description.

You will need to run:

# Re-install kagglehub from this branch
!pip install --force-reinstall --no-deps git+https://github.com/Kaggle/kagglehub.git@datasets-resolver

import kagglehub

kagglehub.dataset_download("jessicali9530/animal-crossing-new-horizons-nookplaza-dataset")

And you can use https://screencast.googleplex.com/ to record and share the link.

Just updated the description 🎉

Copy link
Contributor

@rosbo rosbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I noticed a bug in the backend logic (this should be addressed in a followup PR to our kaggle webtier repo).

If I run:

kagglehub.dataset_download("rosebv/test-versioned/versions/1")

It correctly attaches version 1 of my dataset (it has 3 versions) and attaches files for that version.

However, in the notebook editor UI, the version is not pinned to version 3. And if I recreate my notebook, then, version 3 is attached. The reason is that by default for datasets, the latest version is attached unless you pin a specific version: https://screenshot.googleplex.com/78NtnruigbNFgya

from .server_stubs import serv

INVALID_ARCHIVE_DATASET_HANDLE = "invalid/invalid/invalid/invalid/invalid"
VERSIONED_DATASET_HANDLE = "sarahjeffreson/featured-spotify-artiststracks-with-metadata/versions/1"
Copy link
Contributor

@rosbo rosbo Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonal to this PR but the fact that the handle versioning is different from a model may be bad UX.

For models, the version is added with a /{VERSION_NUMBER} suffix, not /versions/{VERSION_NUMBER}.

However, I do understand that we chose it to ensure it aligns with the dataset URL format.

Curious to have @goeffthomas / @jplotts thoughts on this.

We should not block merging this PR on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a bit rough since Models routing is now very different from Datasets. Would it be a bad idea to support both for Datasets? I know for Models, the UI now has the shortened URLs to match the handles. Being able to support both would mean that if we did want Datasets to follow that pattern, we could do it at our leisure with no impact to kagglehub.

@jeward414 jeward414 merged commit 6546339 into main Jul 31, 2024
7 checks passed
@jeward414 jeward414 deleted the datasets-resolver branch July 31, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants