Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(code): add function to download notebook outputs #184

Merged
merged 1 commit into from
Dec 2, 2024

Conversation

KeijiBranshi
Copy link
Member

@KeijiBranshi KeijiBranshi commented Nov 22, 2024

BUG=b/371574828
CHILD=#185
BLOCKED_BY=go/kaggle-pr/32581,go/kaggle-pr/32632
CC=@rosbo,@dster2,@jplotts

Extends the same functionality from models.py, datasets.py and competition.py to the notebooks at https://kaggle.com/code.

Changes

handle.py

cache.py

  • added new functions to dictate the cached path for notebook outputs based on the properties in CodeHandle
  • this mostly mirrors the same structure as the model, dataset, and competition paths.
    • the cache structure is split into the output_path, archive_path, a _completion_file_marker_path for individual files in the download payload, and a _completion_file_marker_path for the entire download payload.
  • the structure is as follows:
    <cache_root>/
    └── notebooks/
        └── username/
            └── notebook_slug/
                ├── output.complete  <-- tracker for the entire output
                ├── .complete/       <-- trackers for per file within the output
                │   └── output/
                │       ├── file1.txt.complete
                │       ├── file2.txt.complete
                ├── output.archive   <-- the compressed output (.tar.gz or .zip)
                └── output/          <-- the uncompressed output
                    ├── file1.txt
                    └── file2.txt
    

http_resolver.py

  • Implemented the NotebookOutputHttpResolver
  • Note, we don't currently have an API endpoint to download notebook output in a kagglehub-compatible compression format (left a TODO with our internal tracker for this)
  • It leverages the our existing KaggleApiV1Client + the new cache location mentioned above

registry.py + __init__.py

  • bootstraps the NotebookOutputHttpResolver so that it can be called by kagglehub.notebook_output_download in code.py

code.py

  • the entry point to the notebook output downloading functionality
  • the file is named code.py to align with our navigation paths at https://kaggle.com (similar to models, datasets, and competitions). Open to changing if needed.
  • the function is named notebook_output_download to be more specific about what's being downoaded

test_notebook_output_download.py

gcs_upload.py

  • A miscellaneous lint error that slipped through. Fixing here as a drive-by change as per this comment

@KeijiBranshi KeijiBranshi force-pushed the keijibranshi/feat/code/download-output branch from 9bf8815 to 3c297d6 Compare November 22, 2024 18:16
@KeijiBranshi

This comment was marked as resolved.

@rosbo

This comment was marked as resolved.

rosbo

This comment was marked as resolved.

@KeijiBranshi KeijiBranshi force-pushed the keijibranshi/feat/code/download-output branch 2 times, most recently from 7b53aa3 to 0cd1661 Compare November 23, 2024 00:35
@KeijiBranshi

This comment was marked as resolved.

@KeijiBranshi KeijiBranshi force-pushed the keijibranshi/feat/code/download-output branch 2 times, most recently from 6f10385 to d0e5114 Compare November 23, 2024 01:27
@KeijiBranshi

This comment was marked as resolved.

@KeijiBranshi KeijiBranshi force-pushed the keijibranshi/feat/code/download-output branch 9 times, most recently from f9a66b0 to 6eb6af7 Compare November 27, 2024 04:22
@KeijiBranshi KeijiBranshi force-pushed the keijibranshi/feat/code/download-output branch from 28ea68f to 43197c4 Compare November 27, 2024 04:44
@@ -10,7 +10,7 @@
from typing import Optional, Union

import requests
from requests.exceptions import ConnectionError, Timeout
from requests.exceptions import Timeout
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: these are miscellaneous drive-by changes as per #issuecomment-2494710160

@KeijiBranshi KeijiBranshi marked this pull request as ready for review November 27, 2024 04:50
@KeijiBranshi
Copy link
Member Author

fyi @calderjo going to merge now to unblock. lmk if you have any comments though and I can address in a follow up iteration

@KeijiBranshi KeijiBranshi merged commit ffa9ff7 into main Dec 2, 2024
6 checks passed
@KeijiBranshi KeijiBranshi deleted the keijibranshi/feat/code/download-output branch December 2, 2024 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants