-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notebook Output Download versioning #206
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far, thanks Jonathan!
src/kagglehub/http_resolver.py
Outdated
@@ -330,6 +342,9 @@ def _build_list_model_instance_version_files_url_path(h: ModelHandle) -> str: | |||
def _build_get_dataset_url_path(h: DatasetHandle) -> str: | |||
return f"datasets/view/{h.owner}/{h.dataset}" | |||
|
|||
def _build_get_notebook_url_path(h: NotebookHandle) -> str: | |||
return f"kernels/output/download/{h.owner}/{h.notebook}?version_number={h.version}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Ack that you're copying existing datasets code, but this is slightly fragile in that it's silently assuming the optional field h.version
is set, which is true for now but could lead to tricky bugs later. We could put a check + raise
here (and for datasets) to make that assumption explicit. Or make the code robust to handle unset h.version
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is set whenever it checks what the current version is if a specific version isn't provided. However, I do understand that it could be trickier down the line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/kagglehub/clients.py
Outdated
|
||
total_size = 0 | ||
if not isinstance(resource_handle, NotebookHandle): | ||
total_size = int(response.headers["Content-Length"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it seems to work, this approach feels a bit fragile to me. Also I think _download_file
might behave strangely, e.g. rendering tqdm
progress bar with total=0
. IMO it'd be stronger to build more generalized handling of supporting both scenarios. So we'd skip the NotebookHandle
check and set
total_size = int(response.headers["Content-Length"]) if "Content-Length" in response.headers else None
(or similar) Then we'd add a if _is_resumable(response) and total_size
to current line 176 to skip the resuming logic which currently wants (though may not need) total_size
(and the kernel output zip file fails _is_resumable
check anyway). And then update _download_file
to be a bit more graceful with the total_size is None
case, probably just skipping tqdm
entirely?
api_client.download_file(url_path, archive_path, h) | ||
|
||
# Create the directory to extract the archive to. | ||
os.makedirs(out_path, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Might be nicer if _extract_archive
handled this internally... but again fine to punt :)
b/384702998
also enables the unit test checks for the kaggle cache resolver -> b/381119949