Adding dataset downloading to kagglehub #131

jeward414 · 2024-06-04T20:00:01Z

This is part 1 of adding dataset support to kagglehub. This adds the functionality of downloading.

kagglehub.dataset_download("jeward/testDataset")
Optional file path -> kagglehub.dataset_download("jeward/testDataset", "foo.txt")

b/313706281

rosbo

In a follow up PR, can you add support in the Kaggle notebook resolver: https://github.com/Kaggle/kagglehub/blob/main/src/kagglehub/kaggle_cache_resolver.py

You will need to make a backend change first to add support for attaching datasets dynamically.

Add a DatasetReference to the oneof datasource_reference field here: https://github.com/Kaggle/kaggleazure/blob/2308f0d1d087c4b19946024aed8a6306f1bbf4ac/Kaggle.Sdk/kernels/kernels_service.proto#L2998
Implement the logic to dynamically attach a dataset: https://github.com/Kaggle/kaggleazure/blob/457303918d98b8a0d0931ce2ffbc99896253df5c/Kaggle.Services.Kernels/Handlers/AttachDatasourceUsingJwtHandler.cs#L73

The easiest way to test it in the kagglehub integration in a Kaggle notebook environment is through these steps:

Push your changes to GitHub on a branch.
Open a Kaggle notebook on kaggle.com
Run:

!pip install --force-reinstall --no-deps git+https://github.com/Kaggle/kagglehub.git@NAME_OF_YOUR_BRANCH

import kagglehub
kagglehub.dataset_download("jeward/testDataset")

src/kagglehub/cache.py

integration_tests/test_dataset_download.py

src/kagglehub/handle.py

rosbo · 2024-06-05T17:26:36Z

src/kagglehub/handle.py

@@ -6,6 +6,9 @@

 from kagglehub.config import get_kaggle_api_endpoint

+NUM_VERSIONED_DATASET_PARTS = 4  # e.g.: <owner>/<dataset>/versions/<version>


Do we want to add versions/ prefix or not?

Pros of keeping it:

Matches the dataset URL.

Cons of keeping it:

Handle is different from models.

@jplotts any thoughts?

src/kagglehub/handle.py

tests/test_http_dataset_download.py

rosbo

LGTM. Great work. Just a few small suggestions.

integration_tests/utils.py

src/kagglehub/http_resolver.py

jeward414 added 5 commits June 4, 2024 19:58

Adding dataset downloading to kagglehub

1424ab4

Add integration test

54af5e0

lint

2985e1c

Fix integration test

9338a28

Fix lint

3ac1c74

jeward414 marked this pull request as ready for review June 5, 2024 13:17

jeward414 requested review from rosbo and jplotts June 5, 2024 13:17

Add more tests

8224de2

rosbo requested changes Jun 5, 2024

View reviewed changes

jeward414 added 2 commits June 5, 2024 19:16

PR comments

ca72f98

Lint

e5a0d71

jeward414 requested a review from rosbo June 5, 2024 19:26

jeward414 and others added 2 commits June 5, 2024 21:18

Merge branch 'main' into datasets-v2

5585fc7

Fix test

e87c2f8

rosbo approved these changes Jun 5, 2024

View reviewed changes

integration_tests/utils.py Outdated Show resolved Hide resolved

src/kagglehub/http_resolver.py Show resolved Hide resolved

PR suggestions

cebd36e

jeward414 merged commit c676f87 into main Jun 6, 2024
7 checks passed

jeward414 deleted the datasets-v2 branch June 6, 2024 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding dataset downloading to kagglehub #131

Adding dataset downloading to kagglehub #131

jeward414 commented Jun 4, 2024 •

edited

Loading

rosbo left a comment

rosbo Jun 5, 2024

rosbo left a comment •

edited

Loading

		@@ -6,6 +6,9 @@

		from kagglehub.config import get_kaggle_api_endpoint

		NUM_VERSIONED_DATASET_PARTS = 4 # e.g.: <owner>/<dataset>/versions/<version>

Adding dataset downloading to kagglehub #131

Adding dataset downloading to kagglehub #131

Conversation

jeward414 commented Jun 4, 2024 • edited Loading

rosbo left a comment

Choose a reason for hiding this comment

rosbo Jun 5, 2024

Choose a reason for hiding this comment

rosbo left a comment • edited Loading

Choose a reason for hiding this comment

jeward414 commented Jun 4, 2024 •

edited

Loading

rosbo left a comment •

edited

Loading