Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding dataset downloading to kagglehub #131

Merged
merged 11 commits into from
Jun 6, 2024
Merged

Adding dataset downloading to kagglehub #131

merged 11 commits into from
Jun 6, 2024

Conversation

jeward414
Copy link
Contributor

@jeward414 jeward414 commented Jun 4, 2024

This is part 1 of adding dataset support to kagglehub. This adds the functionality of downloading.

kagglehub.dataset_download("jeward/testDataset")
Optional file path -> kagglehub.dataset_download("jeward/testDataset", "foo.txt")

b/313706281

@jeward414 jeward414 marked this pull request as ready for review June 5, 2024 13:17
@jeward414 jeward414 requested review from rosbo and jplotts June 5, 2024 13:17
Copy link
Contributor

@rosbo rosbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow up PR, can you add support in the Kaggle notebook resolver: https://github.com/Kaggle/kagglehub/blob/main/src/kagglehub/kaggle_cache_resolver.py

You will need to make a backend change first to add support for attaching datasets dynamically.

  1. Add a DatasetReference to the oneof datasource_reference field here: https://github.com/Kaggle/kaggleazure/blob/2308f0d1d087c4b19946024aed8a6306f1bbf4ac/Kaggle.Sdk/kernels/kernels_service.proto#L2998
  2. Implement the logic to dynamically attach a dataset: https://github.com/Kaggle/kaggleazure/blob/457303918d98b8a0d0931ce2ffbc99896253df5c/Kaggle.Services.Kernels/Handlers/AttachDatasourceUsingJwtHandler.cs#L73

The easiest way to test it in the kagglehub integration in a Kaggle notebook environment is through these steps:

  1. Push your changes to GitHub on a branch.
  2. Open a Kaggle notebook on kaggle.com
  3. Run:
!pip install --force-reinstall --no-deps git+https://github.com/Kaggle/kagglehub.git@NAME_OF_YOUR_BRANCH

import kagglehub
kagglehub.dataset_download("jeward/testDataset")

@@ -6,6 +6,9 @@

from kagglehub.config import get_kaggle_api_endpoint

NUM_VERSIONED_DATASET_PARTS = 4 # e.g.: <owner>/<dataset>/versions/<version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add versions/ prefix or not?

Pros of keeping it:

  • Matches the dataset URL.

Cons of keeping it:

  • Handle is different from models.

@jplotts any thoughts?

@jeward414 jeward414 requested a review from rosbo June 5, 2024 19:26
Copy link
Contributor

@rosbo rosbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work. Just a few small suggestions.

@jeward414 jeward414 merged commit c676f87 into main Jun 6, 2024
7 checks passed
@jeward414 jeward414 deleted the datasets-v2 branch June 6, 2024 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants