Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Competition Download #158

Merged
merged 15 commits into from
Sep 27, 2024
Merged

Add Competition Download #158

merged 15 commits into from
Sep 27, 2024

Conversation

calderjo
Copy link
Contributor

@calderjo calderjo commented Sep 16, 2024

This is part 1 of 2 for download competition dataset via kagglehub.
higher level view

competition_download takes in

  • competition slug (which is unique)
  • path (to perform single file downloads) (similar to dataset download)
  • force download - (similar to dataset download)

kaggle api (and this method as well) only let's users download latest competition dataset, which is intentional.

a follow up PR, will implement the resolver to attach a competition dataset inside a Kaggle notebook

https://b.corp.google.com/issues/369206113

@calderjo calderjo changed the title stuff Add Competition Download. Sep 20, 2024
@calderjo calderjo changed the title Add Competition Download. Add Competition Download method Sep 20, 2024
@calderjo calderjo changed the title Add Competition Download method Add Competition Download method pt 1 Sep 20, 2024
@calderjo calderjo force-pushed the competition_download branch from 94a6150 to 7400900 Compare September 23, 2024 20:58
@calderjo calderjo requested a review from rosbo September 23, 2024 21:11
@calderjo
Copy link
Contributor Author

issue with the integration tester uploading datasets with taken slug, don't think it's related the changes i made, i think

{'error': 'The requested title "dataset-513096b0-574f-4ede-8e64-cd4b18687dec" is already in use by a dataset. Please c...e64-cd4b18687dec" is already in use by a dataset. Please choose another title.'

Copy link
Contributor

@rosbo rosbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. A few suggestions / questions.

a follow up PR, will add logic to mount competition to inputs for interactive kaggle

I assume your are referring to the resolver to attach a competition dataset inside a Kaggle notebook?

return os.path.join(
get_cache_folder(),
COMPETITIONS_CACHE_SUBFOLDER,
COMPETITIONS_INDIVIDUAL_FILE_MARKER_FOLDER,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using a different marker than datasets & models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it gets a bit weird when testing.

since competition file paths are shallow compared to dataset and model i.e "competition/comp-name/file" vs "datasets/jcchavez/dataset-name/version/1/file" , there isn't a good place to put markers where it won't conflict with the "ls" command, causing tests to fail.

conflict in this case meaning "ls" will read-in the marker along with the downloaded files i.e.
[test.csv.complete, test.csv], where we would instead want just [test.csv].

I could make tests execute the markers, but i think users don't want to adjust their commands to avoid markers. so i opted to move markers into their own folder, and read from that folder for caching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this wouldn't work?

competitions/titanic/train.csv
competitions/.complete/titanic/titanic.csv.complete

Also, you will need to track version somehow in the cache because if a new competition dataset is released by the host, you want to download it. If you don't track the version, it would simply return the "stale" files from your local cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did some looking, seems like kaggle api does this by checking the "last modified" timestamp and comparing.
can we go by this approach?

https://github.com/Kaggle/kaggle-api/blob/ded7a528499d5db351f94c867a2441c1101f4ad0/kaggle/api/kaggle_api_extended.py#L3778-L3810

if your referring to using like "ids" like we use for competitions datasets, i don't think we have a method that directly exposes those details. closest i found was the signed Url (which has the id) that get returned in the response, that we could parse that and cache that for later comparisons.

i might be overthinking this, maybe there way simpler way...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to base this on "last modified".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made changes for both.

"test.csv",
"train.csv",
]
for p in file_paths:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error is shown to the user if they haven't accepted the competition rules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

authentication error: 403

i've updated tests to better capture the error that supposed to be given.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would show a better error message so the user can understand why they can't download the dataset and what they must do to get access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently the error message is identical to that of dataset and model:
https://screenshot.googleplex.com/47Ci23jVYpb6uYK

"You don't have permission to access resource at URL: https://www.kaggle.com/competitions/m5-forecasting-accuracy
Please make sure you are authenticated if you are trying to access a private resource or a resource requiring consent."

would we want to add an additional:
"please accept competition rule" something along those lines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and maybe with a link to where to go to accept the rules: https://www.kaggle.com/competitions/m5-forecasting-accuracy/rules

And of course, we should only show the message about accepting rules for competition datasets and not other types of resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d.

@calderjo calderjo requested a review from rosbo September 24, 2024 22:26
@calderjo calderjo requested a review from jplotts September 27, 2024 00:45
@calderjo
Copy link
Contributor Author

calderjo commented Sep 27, 2024

vincent is ooo until oct 7,

adding jim, and adding jon-w @jeward414 (optional, as they worked on dataset version of this) as reviewers

Copy link
Contributor

@jeward414 jeward414 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few nits

Copy link
Contributor

@jplotts jplotts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, nice work!

@@ -143,6 +158,11 @@ def download_file(self, path: str, out_file: str, resource_handle: Optional[Reso
total_size = int(response.headers["Content-Length"])
size_read = 0

if isinstance(resource_handle, CompetitionHandle) and not _download_needed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this something we only do for competitions? It seems like returning early would be good for all downloads, right?

Copy link
Contributor Author

@calderjo calderjo Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets and models returns early by checking and caching version numbers, which they get from kaggle api.

Competitions doesn't have such handler, since we allow users to only get the latest version, so we took this approach of checking the "last mod. date" instead.

we could extend it for other types, but might be overkill, since we would expect those requests to have returned before this function is called.

@calderjo calderjo merged commit b05b638 into main Sep 27, 2024
6 checks passed
@calderjo calderjo deleted the competition_download branch September 27, 2024 16:52
@calderjo calderjo changed the title Add Competition Download method pt 1 Add Competition Download Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants