Add Competition Download #158

calderjo · 2024-09-16T22:41:26Z

This is part 1 of 2 for download competition dataset via kagglehub.
higher level view

competition_download takes in

competition slug (which is unique)
path (to perform single file downloads) (similar to dataset download)
force download - (similar to dataset download)

kaggle api (and this method as well) only let's users download latest competition dataset, which is intentional.

a follow up PR, will implement the resolver to attach a competition dataset inside a Kaggle notebook

https://b.corp.google.com/issues/369206113

calderjo · 2024-09-23T21:17:36Z

issue with the integration tester uploading datasets with taken slug, don't think it's related the changes i made, i think

{'error': 'The requested title "dataset-513096b0-574f-4ede-8e64-cd4b18687dec" is already in use by a dataset. Please c...e64-cd4b18687dec" is already in use by a dataset. Please choose another title.'

rosbo

Great job. A few suggestions / questions.

a follow up PR, will add logic to mount competition to inputs for interactive kaggle

I assume your are referring to the resolver to attach a competition dataset inside a Kaggle notebook?

rosbo · 2024-09-23T23:03:13Z

src/kagglehub/cache.py

+        return os.path.join(
+            get_cache_folder(),
+            COMPETITIONS_CACHE_SUBFOLDER,
+            COMPETITIONS_INDIVIDUAL_FILE_MARKER_FOLDER,


Why using a different marker than datasets & models?

yeah it gets a bit weird when testing.

since competition file paths are shallow compared to dataset and model i.e "competition/comp-name/file" vs "datasets/jcchavez/dataset-name/version/1/file" , there isn't a good place to put markers where it won't conflict with the "ls" command, causing tests to fail.

conflict in this case meaning "ls" will read-in the marker along with the downloaded files i.e.
[test.csv.complete, test.csv], where we would instead want just [test.csv].

I could make tests execute the markers, but i think users don't want to adjust their commands to avoid markers. so i opted to move markers into their own folder, and read from that folder for caching.

example where we check file contents

https://github.com/Kaggle/kagglehub/pull/158/files#diff-36647608fc859a5b49f8e49a09525e74fc13f5b36fef10894cb3dd9e7966a1acR54

Why this wouldn't work?

competitions/titanic/train.csv
competitions/.complete/titanic/titanic.csv.complete

Also, you will need to track version somehow in the cache because if a new competition dataset is released by the host, you want to download it. If you don't track the version, it would simply return the "stale" files from your local cache.

did some looking, seems like kaggle api does this by checking the "last modified" timestamp and comparing.
can we go by this approach?

https://github.com/Kaggle/kaggle-api/blob/ded7a528499d5db351f94c867a2441c1101f4ad0/kaggle/api/kaggle_api_extended.py#L3778-L3810

if your referring to using like "ids" like we use for competitions datasets, i don't think we have a method that directly exposes those details. closest i found was the signed Url (which has the id) that get returned in the response, that we could parse that and cache that for later comparisons.

i might be overthinking this, maybe there way simpler way...

Makes sense to base this on "last modified".

made changes for both.

src/kagglehub/competition.py

rosbo · 2024-09-23T23:07:47Z

integration_tests/test_competition_download.py

+                "test.csv",
+                "train.csv",
+            ]
+            for p in file_paths:


What error is shown to the user if they haven't accepted the competition rules?

authentication error: 403

i've updated tests to better capture the error that supposed to be given.

Ideally, we would show a better error message so the user can understand why they can't download the dataset and what they must do to get access.

currently the error message is identical to that of dataset and model:
https://screenshot.googleplex.com/47Ci23jVYpb6uYK

"You don't have permission to access resource at URL: https://www.kaggle.com/competitions/m5-forecasting-accuracy
Please make sure you are authenticated if you are trying to access a private resource or a resource requiring consent."

would we want to add an additional:
"please accept competition rule" something along those lines?

Yes, and maybe with a link to where to go to accept the rules: https://www.kaggle.com/competitions/m5-forecasting-accuracy/rules

And of course, we should only show the message about accepting rules for competition datasets and not other types of resources.

src/kagglehub/handle.py

calderjo · 2024-09-27T00:48:15Z

vincent is ooo until oct 7,

adding jim, and adding jon-w @jeward414 (optional, as they worked on dataset version of this) as reviewers

jeward414

Looks good! Just a few nits

tests/test_handle.py

tests/test_http_competition_download.py

jplotts

Overall LGTM, nice work!

jplotts · 2024-09-27T14:37:12Z

src/kagglehub/clients.py

@@ -143,6 +158,11 @@ def download_file(self, path: str, out_file: str, resource_handle: Optional[Reso
            total_size = int(response.headers["Content-Length"])
            size_read = 0

+            if isinstance(resource_handle, CompetitionHandle) and not _download_needed(


Why is this something we only do for competitions? It seems like returning early would be good for all downloads, right?

Datasets and models returns early by checking and caching version numbers, which they get from kaggle api.

Competitions doesn't have such handler, since we allow users to only get the latest version, so we took this approach of checking the "last mod. date" instead.

we could extend it for other types, but might be overkill, since we would expect those requests to have returned before this function is called.

src/kagglehub/exceptions.py

calderjo changed the title ~~stuff~~ Add Competition Download. Sep 20, 2024

calderjo changed the title ~~Add Competition Download.~~ Add Competition Download method Sep 20, 2024

calderjo changed the title ~~Add Competition Download method~~ Add Competition Download method pt 1 Sep 20, 2024

calderjo added 10 commits September 23, 2024 20:57

stuff

9f23132

ssss

31eed5f

asasdfasdf

a759ea8

work it works

3397b40

cleaned up

f41255f

lint 1

4e7004e

ipppp

14a7d81

unit test complete

675d5e4

int test

f131922

typo

7400900

calderjo force-pushed the competition_download branch from 94a6150 to 7400900 Compare September 23, 2024 20:58

calderjo requested a review from rosbo September 23, 2024 21:11

rosbo reviewed Sep 23, 2024

View reviewed changes

reviewer feedback pt 1

64ef30c

calderjo requested a review from rosbo September 24, 2024 22:26

calderjo added 3 commits September 27, 2024 00:28

reviwer feedback pt2

ae502c4

format lol

714fac2

lint 22

f3c9580

calderjo requested a review from jplotts September 27, 2024 00:45

jeward414 approved these changes Sep 27, 2024

View reviewed changes

tests/test_handle.py Outdated Show resolved Hide resolved

tests/test_http_competition_download.py Outdated Show resolved Hide resolved

jplotts approved these changes Sep 27, 2024

View reviewed changes

reviewer feedback 3

4d66e71

calderjo merged commit b05b638 into main Sep 27, 2024
6 checks passed

calderjo deleted the competition_download branch September 27, 2024 16:52

calderjo changed the title ~~Add Competition Download method pt 1~~ Add Competition Download Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Competition Download #158

Add Competition Download #158

calderjo commented Sep 16, 2024 •

edited

Loading

calderjo commented Sep 23, 2024

rosbo left a comment

rosbo Sep 23, 2024

calderjo Sep 24, 2024

calderjo Sep 24, 2024

rosbo Sep 24, 2024

calderjo Sep 25, 2024

rosbo Sep 25, 2024

calderjo Sep 27, 2024

rosbo Sep 23, 2024

calderjo Sep 24, 2024

rosbo Sep 24, 2024

calderjo Sep 24, 2024

rosbo Sep 24, 2024

calderjo Sep 27, 2024

calderjo commented Sep 27, 2024 •

edited

Loading

jeward414 left a comment

jplotts left a comment

jplotts Sep 27, 2024

calderjo Sep 27, 2024 •

edited

Loading

Add Competition Download #158

Add Competition Download #158

Conversation

calderjo commented Sep 16, 2024 • edited Loading

calderjo commented Sep 23, 2024

rosbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calderjo commented Sep 27, 2024 • edited Loading

jeward414 left a comment

Choose a reason for hiding this comment

jplotts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calderjo Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

calderjo commented Sep 16, 2024 •

edited

Loading

calderjo commented Sep 27, 2024 •

edited

Loading

calderjo Sep 27, 2024 •

edited

Loading