Skip to content

Conversation

@SiqiChen9
Copy link
Contributor

Description

This PR fixes issue #754 where the Kaggle API would exhaust system memory when downloading large datasets.

Problem

The HTTP client was not using stream=True when making requests for file downloads. This caused the entire file content to be loaded into memory before being written to disk, making the system unstable when downloading large datasets.

Solution

  • Modified KaggleHttpClient.call() to detect file download response types (FileDownload and HttpRedirect)
  • Automatically enable streaming (stream=True) for these response types
  • The existing download_file() method already uses response.iter_content() for chunked reading, which now works properly with streaming enabled

Changes

  • src/kagglesdk/kaggle_http_client.py

    • Added imports for FileDownload and HttpRedirect types
    • Added logic to set stream=True in request settings for file downloads
  • pyproject.toml

    • Removed kagglesdk from dependencies list
    • Reason: kagglesdk source code is in src/kagglesdk/ and should use local code during editable install, not pull from PyPI

Impact

This fix improves memory usage for:

  • Competition file downloads (competition_download_file, competition_download_files)
  • Dataset downloads (dataset_download_file, dataset_download_files)
  • Model downloads (model_instance_version_download)
  • Kernel output downloads (kernels_output)
  • Leaderboard downloads (competition_leaderboard_download)

Testing

@google-cla
Copy link

google-cla bot commented Nov 24, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@stevemessick
Copy link
Contributor

@SiqiChen9 Thanks for the PR! Could you sign the Google CLA so I can merge it?

BTW why did you remove kagglesdk from the .toml file?

Enable streaming for file downloads by passing stream=True to requests.
This prevents loading entire files into memory when downloading datasets,
competitions, models, and kernel outputs.

Fixes Kaggle#754
@SiqiChen9
Copy link
Contributor Author

@stevemessick Hi! Thanks for the quick response! I've signed the Google CLA.

Regarding the kagglesdk removal from [pyproject.toml] - you're absolutely right, that change should not be included in this PR. I apologize for the confusion.

I encountered this during my local development setup:

  • When running editable install (pip install -e .), I got an import error because PyPI's kagglesdk version was outdated compared to the local src/kagglesdk code
  • I removed it from dependencies to force using local code, but this was only relevant to my development environment

I've reverted the pyproject.toml change. The PR now only includes the streaming fix to [src/kagglesdk/kaggle_http_client.py].

Updated!

Copy link
Contributor

@stevemessick stevemessick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@stevemessick stevemessick merged commit 6147a8e into Kaggle:main Nov 24, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shouldn't API client pass stream = True to the requests when downloading datasets?

2 participants