-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Competition Download #158
Conversation
94a6150
to
7400900
Compare
issue with the integration tester uploading datasets with taken slug, don't think it's related the changes i made, i think {'error': 'The requested title "dataset-513096b0-574f-4ede-8e64-cd4b18687dec" is already in use by a dataset. Please c...e64-cd4b18687dec" is already in use by a dataset. Please choose another title.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job. A few suggestions / questions.
a follow up PR, will add logic to mount competition to inputs for interactive kaggle
I assume your are referring to the resolver to attach a competition dataset inside a Kaggle notebook?
src/kagglehub/cache.py
Outdated
return os.path.join( | ||
get_cache_folder(), | ||
COMPETITIONS_CACHE_SUBFOLDER, | ||
COMPETITIONS_INDIVIDUAL_FILE_MARKER_FOLDER, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using a different marker than datasets & models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it gets a bit weird when testing.
since competition file paths are shallow compared to dataset and model i.e "competition/comp-name/file" vs "datasets/jcchavez/dataset-name/version/1/file" , there isn't a good place to put markers where it won't conflict with the "ls" command, causing tests to fail.
conflict in this case meaning "ls" will read-in the marker along with the downloaded files i.e.
[test.csv.complete, test.csv], where we would instead want just [test.csv].
I could make tests execute the markers, but i think users don't want to adjust their commands to avoid markers. so i opted to move markers into their own folder, and read from that folder for caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example where we check file contents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this wouldn't work?
competitions/titanic/train.csv
competitions/.complete/titanic/titanic.csv.complete
Also, you will need to track version somehow in the cache because if a new competition dataset is released by the host, you want to download it. If you don't track the version, it would simply return the "stale" files from your local cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did some looking, seems like kaggle api does this by checking the "last modified" timestamp and comparing.
can we go by this approach?
if your referring to using like "ids" like we use for competitions datasets, i don't think we have a method that directly exposes those details. closest i found was the signed Url (which has the id) that get returned in the response, that we could parse that and cache that for later comparisons.
i might be overthinking this, maybe there way simpler way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to base this on "last modified".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made changes for both.
"test.csv", | ||
"train.csv", | ||
] | ||
for p in file_paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What error is shown to the user if they haven't accepted the competition rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
authentication error: 403
i've updated tests to better capture the error that supposed to be given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would show a better error message so the user can understand why they can't download the dataset and what they must do to get access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently the error message is identical to that of dataset and model:
https://screenshot.googleplex.com/47Ci23jVYpb6uYK
"You don't have permission to access resource at URL: https://www.kaggle.com/competitions/m5-forecasting-accuracy
Please make sure you are authenticated if you are trying to access a private resource or a resource requiring consent."
would we want to add an additional:
"please accept competition rule" something along those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and maybe with a link to where to go to accept the rules: https://www.kaggle.com/competitions/m5-forecasting-accuracy/rules
And of course, we should only show the message about accepting rules for competition datasets and not other types of resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d.
vincent is ooo until oct 7, adding jim, and adding jon-w @jeward414 (optional, as they worked on dataset version of this) as reviewers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, nice work!
@@ -143,6 +158,11 @@ def download_file(self, path: str, out_file: str, resource_handle: Optional[Reso | |||
total_size = int(response.headers["Content-Length"]) | |||
size_read = 0 | |||
|
|||
if isinstance(resource_handle, CompetitionHandle) and not _download_needed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this something we only do for competitions? It seems like returning early would be good for all downloads, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datasets and models returns early by checking and caching version numbers, which they get from kaggle api.
Competitions doesn't have such handler, since we allow users to only get the latest version, so we took this approach of checking the "last mod. date" instead.
we could extend it for other types, but might be overkill, since we would expect those requests to have returned before this function is called.
This is part 1 of 2 for download competition dataset via kagglehub.
higher level view
competition_download takes in
kaggle api (and this method as well) only let's users download latest competition dataset, which is intentional.
a follow up PR, will implement the resolver to attach a competition dataset inside a Kaggle notebook
https://b.corp.google.com/issues/369206113