Skip to content

Commit 6147a8e

Browse files
authored
Fix memory exhaustion when downloading large files (#869)
## Description This PR fixes issue #754 where the Kaggle API would exhaust system memory when downloading large datasets. ### Problem The HTTP client was not using `stream=True` when making requests for file downloads. This caused the entire file content to be loaded into memory before being written to disk, making the system unstable when downloading large datasets. ### Solution - Modified `KaggleHttpClient.call()` to detect file download response types (`FileDownload` and `HttpRedirect`) - Automatically enable streaming (`stream=True`) for these response types - The existing `download_file()` method already uses `response.iter_content()` for chunked reading, which now works properly with streaming enabled ### Changes - **src/kagglesdk/kaggle_http_client.py** - Added imports for `FileDownload` and `HttpRedirect` types - Added logic to set `stream=True` in request settings for file downloads - **pyproject.toml** - Removed `kagglesdk` from dependencies list - Reason: `kagglesdk` source code is in `src/kagglesdk/` and should use local code during editable install, not pull from PyPI ### Impact This fix improves memory usage for: - Competition file downloads (`competition_download_file`, `competition_download_files`) - Dataset downloads (`dataset_download_file`, `dataset_download_files`) - Model downloads (`model_instance_version_download`) - Kernel output downloads (`kernels_output`) - Leaderboard downloads (`competition_leaderboard_download`) ### Testing - ✅ Tested CLI: `kaggle competitions list` works correctly - ✅ Verified streaming is enabled for file download requests - ✅ Backward compatible - no changes to public API Fixes #754
1 parent 03367e7 commit 6147a8e

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

src/kagglesdk/kaggle_http_client.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@
1616
KaggleEnv,
1717
)
1818
from kagglesdk.kaggle_object import KaggleObject
19+
from kagglesdk.common.types.file_download import FileDownload
20+
from kagglesdk.common.types.http_redirect import HttpRedirect
1921
from typing import Type
2022

2123
# TODO (http://b/354237483) Generate the client from the existing one.
@@ -81,6 +83,12 @@ def call(
8183

8284
# Merge environment settings into session
8385
settings = self._session.merge_environment_settings(http_request.url, {}, None, None, None)
86+
87+
# Use stream=True for file downloads to avoid loading entire file into memory
88+
# See: https://github.com/Kaggle/kaggle-api/issues/754
89+
if response_type is not None and (response_type == FileDownload or response_type == HttpRedirect):
90+
settings['stream'] = True
91+
8492
http_response = self._session.send(http_request, **settings)
8593

8694
response = self._prepare_response(response_type, http_response)

0 commit comments

Comments
 (0)